Parallelization

Z MetaCentrum
Skočit na navigaci Skočit na vyhledávání

(Česká verze)


Metacentrum wiki is deprecated after March 2023
Dear users, due to integration of Metacentrum into https://www.e-infra.cz/en (e-INFRA CZ service), the documentation for users will change format and site.
The current wiki pages won't be updated after end of March 2023. They will, however, be kept for a few months for backwards reference.
The new documentation resides at https://docs.metacentrum.cz.
Related topics
Applications with MPI
[PBS options for MPI processes]

Parallel computing can significantly shorten time of your job because the job uses multiple resources at once. MetaCentrum offers two ways of parallel computing - OpenMP and MPI, which can be used separately or can be combined.

OpenMP

If your application is able to use multiple threads via a shared memory, request a single chunk with multiple processors and make sure the variable OMP_NUM_THREADS is set.

Note: Setting the variable OMP_NUM_THREADS is important, as it restricts the number of processes that can run in parallel. If OMP_NUM_THREADS is not set, the application may try to use all the available cores and batch system will kill your job.

For example with qsub command:

qsub -l select=1:ncpus=4:ompthreads=4:mem=16gb:scratch_local=5gb -l walltime=24:00:00 script.sh

add to the batch script the line:

export OMP_NUM_THREADS=4 # write the number explicitly

or (a safer way):

export OMP_NUM_THREADS=$PBS_NUM_PPN # set it equal to PBS variable PBS_NUM_PPN (number of CPUs in a chunk)

MPI

ZarovkaMala.png Note: Running an MPI computation is possible via mpirun command

If your application consists of multiple processes communicating via a message passing interface (see applications with MPI), request for a set of chunks with arbitrary number of processors. For example:

qsub -l select=2:ncpus=2:mem=1gb:scratch_local=2gb -l walltime=1:00:00 script.sh

For most applications, it is preferable to use large chunks (many nodes with 32 or 64 CPUs (cores) are available in Metacentrum) rather than many small chunks since communication inside shared of memory of single node is faster than external network. PBS may or may not place multiple chunks on single node (depending on available resources and other jobs), in special cases when each chunk must be placed on a different node, use -l place = scatter parameter.

 qsub -l select=2:ncpus=2:mem=1gb:scratch_local=2gb -l place=scatter -l walltime=1:00:00 skript.sh

Then run your calculation as

mpirun myMPIapp


Use InfiniBand connection

To get even a better speedup, you can request special nodes, which are interconnected by a low-latency InfiniBand connection.

qsub -l select=4:ncpus=4:mem=1gb:scratch_local=1gb -l walltime=1:00:00 -l place=group=infiniband script.sh

MPI and OpenMP interaction

If your application supports both types of parallelization (MPI and OpenMP), you can combine them. This requires some level of caution, otherwise the job might get to conflict with the scheduler.

PBS options for parallelization are:


  • ompthreads=[number]: how many OpenMP threads can run on 1 chunk


  • mpiprocs=[number]: how many MPI processes can run on 1 chunk


Examples of correct using the OpenMP library:

ZarovkaMala.png Note: First two examples are interchangeable, however they can influence the calculation speed. Try both and select the faster method

Requested resources Example
1 device, more processors
export OMP_NUM_THREADS=$PBS_NUM_PPN
mpirun -n 1 /path/to/program ...
1 device, more processors
export OMP_NUM_THREADS=1
mpirun /path/to/program ...
2 device, more processors
cat $PBS_NODEFILE |uniq >nodes.txt
export OMP_NUM_THREADS=$PBS_NUM_PPN
mpirun -n 2 --hostfile nodes.txt /path/to/program ...