Matlab/Distributed computations using Torque
In the following text we distinguish master instance of MATLAB that runs purely on a single CPUand it will control (it requires the Distrib_Computing_Toolbox licence) other slave workers (they need MATLAB_Distrib_Comp_Engine licences) that work out individual tasks of all jobs. MATLAB uses the following terms:
- Distributed/Simple job consists of tasks that do not directly communicate with each other and they do not need to run simultaneously. One worker maigt run several tasks of the same job in succession.
- Parallel jobs are those in which the workers (or labs) can communicate with each other during the evaluation of their tasks. A special case of the parallel job is so called MatlabPool job that requires the internal MATLAB scheduler (job manager) that is not supported at the moment in MetaCentre.
For simplicity let themaster MATLAB run in an interactive regime:
$ qsub -q short -I -l nodes=1:ppn=1,matlab=1,matlab_Distrib_Computing_Toolbox=1
Later, once the parallel execution will be successfully tested, it is possible to put all the commands below into one file say master.m and run it as a batch job.
On an obtained machine we run MATLAB an in there the following commands:
>> sched = findResource('scheduler','type','torque');
>> set(sched,'ClusterOsType', 'unix');
>> set(sched,'DataLocation',['/home/storage/' getenv('LOGNAME') '/matlab']);
>> set(sched,'HasSharedFilesystem', true);
>> set(sched,'ClusterMatlabRoot',matlabroot); % in all MetaCentre the MATLAB is available from the same directory
>> set(sched,'SubmitArguments','-q short');
>> set(sched,'ResourceTemplate','-l nodes=^N^:ppn=1,matlab=1,matlab_MATLAB_Distrib_Comp_Engine=1');
>> get(sched)
The value of DataLocation specifies a directory where MATLAB saves all the data that needs to be transfered to the workers and also results and log files. Parameters SubmitArguments and ResourceTemplate contain parameters for qsub that is called internally by MATLABto run individual tasks. Literal ^N^ is replaced by MATLAB automatically to specify the number of workers that our job needs; we also ask for the MATLAB_Distrib_Comp_Engine licence (each job requires ^N^ licences). Directory DataLocation and other parameters like ResourceTemplate (specifying the cluster, city etc.) have to be set consistently, i.e. the directory has to exists on all potential worker nodes where the tasks may be executed, etc.
Let us create a simple test job, you may look at i as a collector for tasks that will run in parallel independently (or they can even intercommunicate). By the job reference we also get the results from all the tasks.
Example 1: simpleJob
>> job = createJob(sched);
>> createTask(job, @rand, 1, {10,10}); % first random matrix 10x10
>> createTask(job, @rand, 1, {10,10}); % second random matrix 10x10
>> get(job)
We created two tasks both will run independently each will be processed by its own worker.
>> submit(job); % MATLAB internally calls qsub
>> waitForState(job); % waiting for tasks to finish
>> results = getAllOutputArguments(job); % collecting data from each tasks
>> celldisp(results); % display results
After job is submitted, the tasks will become regular TORQUE jobs and their state can be checked by 'qstat' or online at the Personal view. The loop waitForState waits untill all tasks finishes. At this moment it is possible to cancel the master MATLAB instance and resulting data collect later from the DataLocation directory.
If all taskts finished successfully and we saved our results, it is possible to remove all files from the DataLocation:
finished_jobs = findJob(sched, 'State', 'finished');
destroy(finished_jobs); % clean up
clear finished_jobs
Example 2: parallelJob
>> pjob = createParallelJob(sched);
>> createTask(pjob, @rand, 1, {10,10}); % random matrix 10x10
>> set(pjob,'MinimumNumberOfWorkers',3);
>> set(pjob,'MaximumNumberOfWorkers',3);
>> get(pjob)
We created one task that will run on three workers.
>> submit(pjob); % MATLAB internally calls qsub
>> waitForState(pjob); % wait for all workers to finish
>> results = getAllOutputArguments(pjob); % gather data from all workers
>> celldisp(results); % display results
>> destroy(pjob); % clean up