About scheduling system
Metacentrum wiki is deprecated after March 2023
Dear users, due to integration of Metacentrum into https://www.e-infra.cz/en (e-INFRA CZ service), the documentation for users will change format and site. The current wiki pages won't be updated after end of March 2023. They will, however, be kept for a few months for backwards reference. The new documentation resides at https://docs.metacentrum.cz. |
Batch jobs system PBS (Portable Batch System) allows interactive and non-interactive running of jobs with different requirements in such a way that shared resources are used rationally and equally. It prevents monopolization of resources by specific jobs or specific users. This document describes PBS usage in Metacentrum environment.
Basic concepts - jobs, queues, resources
Basic term of PBS system from the user point of view is a computational job. The job can be batch (pre-prepared, non-interactive) or interactive, one- or many-processors.
After the job is submitted, PBS returns job identifier that is used track and manipulate the job during its lifecycle. The qsub command also checks if user has valid Kerberos tickets (and PBS will guarantee that a job will have the tickets during its execution). Program output is saved in a folder from which the job was submitted.
Jobs are inserted into so-called queues of PBS server using command qsub, where they wait for a situation suitable for their run especially from system load point of view. Freely available queues differ in maximal allowed duration of job run. Other queues are dedicated to special projects or user groups (ncbr, iti, gridlab, ...) or for special types of jobs (gpu, ...). Some queues are available only for listed users or groups.
Computational nodes have limited resources, such as memory, CPUs, duration of the job or disk space, but a resource is also a licence for specific software or a place where the node physically resides. A job can be run only after all resources required by the job are free. When any resource (most typically the duration of the job) is exceeded, the job is forcefully terminated by PBS.
Note: Every single job requires some resources on the scheduler site. In case of very short jobs, the planning may take longer than the job itself. Therefore, if you need to submit many ( more then thousand) short (less than 10 minutes) jobs, we strongly recommend to run them in batches submitted as one job. To prevent PBS server glutting, there is a quota of 10 000 jobs (running or queuing) per user.
Useful links
Computational nodes have properties describing their architecture, operation system, special networking equipment or special access restrictions. Properties are visible at Metacentrum web.
You can use the tool Command qsub refining for condition definition. Current state of queues, clusters and jobs is synoptically available also on MetaCentrum web in Current state section.
Types of queues
Current list of queues including their set up can be found in Pbsmon application or by qstat -q or qstat -Q commands. Queues with limited access are marked with lock. Below is the list of most used queues.
Jobs are sorted to queues according to mandatory entered maximal running time of the job (2h, 4h, 1 day, 2 days, 4 days, 1 week, 2 weeks and more than 2 weeks).
Queue backfill is designed for "filling" jobs. It has a low priority, jobs can be killed while needed, only one machine is allowed for computing, but there can be lots of them.
Queue preemptible allows users to use owner reserved clusters where no common queues are available. Cluster owners are able to stop the other people's running jobs and run their own jobs immediately. Jobs can be suspended up to 30 days.
PBS basic commands
- qsub – submit the job to the queue
- qdel – cancel a waiting or running job
- qmove – move the job to another queue (only waiting jobs!)
- pbsnodes – get current node state and its properties
- qstat – current state of jobs
The volume of displayed information can be adjusted by -v and -vv parameters.
qstat options
qsub options
Command qsub serves for submitting a job into a queue. Common syntax for qsub command can be expressed as (-l (lowercase "L") is a separator between options):
qsub [-q queue] -l resource=value [-l resource2=value2] ... [-l resourceN=valueN] script
The switch -q specifies a queue where the job is enqueued in. If an option -q is missing, the queue "default" will be used.
The argument script has to be a name of file containing shell script (implicitly /bin/sh script), that will be interpreted during job execution. The options of qsub command can be inserted into comments of this script (lines with a sign hash ("#") in the first column). If an argument script is missing, the qsub command will read this script from the standard input.
The above general qsub command can look, for example, like this:
qsub -l select=1:ncpus=1:mem=1gb:scratch_local=10gb -l walltime=1:00:00 script.sh
How to set number of nodes and processors
Number of processors and "chunks" is set with -l select=[number]:ncpus=[number]. Terminology of PBS Pro defines "chunk" as further indivisible set of resources allocated to a job on 1 physical node. Chunks can be on one machine next to each other or conversely always on different machines, eventually they can be placed according to available resources. Note that only one select argument is allowed at a time. Examples:
- -l select=1:ncpus=2 – two processors on one chunk
- -l select=2:ncpus=1 – two chunks each with one processor
- -l select=1:ncpus=1+1:ncpus=2 – two chunks, one with one processor and second with two processors
- -l select=2:ncpus=1 -l place=pack – all chunks must be on one node (if there is not any big enough node, the job will never run)
- -l select=2:ncpus=1 -l place=scatter – each chunk will be placed on different node
- -l select=2:ncpus=1 -l place=free – permission to plane chunks on nodes arbitrarily, according to actual resource availability on nodes (default behavior for PBS Pro):
If you are not sure about the number of needed processors, ask for an exclusive reservation of the whole machine using the parameter "-l place=":
- -l select=2:ncpus=1 -l place=exclhost – request for 2 exclusive nodes (without cpu and mem limit control)
- -l select=3:ncpus=1 -l place=scatter:excl – it is possible to combine exclusivity with specification of chunk planning
- -l select=102:place=group=cluster – 102 cpus on one cluster
How to set the size of scratch
Scratch directory is a disk space on current computational node used to store temporary files. Always specify type and size of scratch, PBS has no default scratch assigned. Scratch type can be one of scratch_local|scratch_ssd|scratch_shared
. Examples:
- -l select=1:ncpus=1:mem=4gb:scratch_local=10gb
- -l select=1:ncpus=1:mem=4gb:scratch_ssd=1gb
- -l select=1:ncpus=1:mem=4gb:scratch_shared=1gb
After the request for scratch if specified, following variables are present in work environment:
$SCRATCH_VOLUME
= size of scratch$SCRATCHDIR
= path to scratch directory$SCRATCH_TYPE
= either ofscratch_local
,scratch_ssd
,scratch_shared
How to set the memory
Amount of needed memory – job is implicitly assigned with 400MB of memory if not specified otherwise. Examples:
- -l select=1:ncpus=1:mem=10gb
- -l select=1:ncpus=1:mem=200mb
How to set the duration of a job (walltime)
Maximal duration of a job is set by -l walltime=[[hh:]mm:]ss, default walltime is 24:00:00. Queues q_* (such as q_2h, q_2d etc.) are not accessible for submit jobs, rout queue (default) automatically chose appropriate time queue based on specified walltime. Examples:
- -l walltime=1:00:00 (one hour)
- -l walltime=120:00:00 (5 days)
How to reserve a licence
Some software require licence. Licence is set by parameter -l
- -l select=3:ncpus=1 -l walltime=1:00:00 -l matlab=1 – one licence for Matlab
How to setup email notification about job state
PBS server sends email notification when the job changes state.
- -m a send mail when job is aborted by batch system
- -m b send mail when job begins execution
- -m e send mail when job ends execution
- -m n do not send mail
The options a, b, e can be combined, e.g.:
- -m abe – sends an email when the job aborts (a), begins (b) and completes/ends (e)
The email can be sent to any email address using the -M option:
- -M james@pbspro.com
How to save the output elsewhere
By default the job output (output, and error files) is saved in a folder from which the job was submitted (variable PBS_O_WORKDIR
).
This behaviour for output, resp. error files can be changed by parameters -o, resp -e.
- -o /custom-path/myOutputFile
- -e /custom-path/myErrorFile
How to choose specific queue or PBS server
If you need to send the job to a specific queue and/or specific PBS server, use the qsub -q destination option.
The argument destination can be one of the following:
queue@server # specific queue on specific server, queue # specific queue on the current (default) server, @server # default queue on specific server.
E. g. qsub -q oven@meta-pbs.metacentrum.cz will send the job to a queue oven on server meta-pbs.metacentrum.cz. Similarly, qsub -q @cerit-pbs.cerit-sc.cz will send the job to default queue managed by cerit pbs server, no matter which frontend the job is sent from.
How to submit a job on nodes with a particular OS
To submit a job to a machine with Debian9, please use "os=debian9", or "os=centos7" in job specification:
zuphux$ qsub -l select=1:ncpus=2:mem=1gb:scratch_local=1gb:os=debian9 …
To submit a job to a machine with any Debian*, please use "osfamily=debian", or "osfamily=redhat" (for rhel or centos) in job specification:
zuphux$ qsub -l select=1:ncpus=2:mem=1gb:scratch_local=1gb:osfamily=debian …
To run tasks on a machine with any OS, type "os = ^ any"
zuphux$ qsub -l select=1:ncpus=2:mem=1gb:scratch_local=1gb:os=^any …
If you experience any problem with libraries or applications compatibility in Debian9, please add module debian8-compat.
How to choose/avoid a particular cluster
The PBS allows you to choose a particular cluster:
qsub -l select=X:ncpus=X:mem=Xmb:scratch_local=X:cluster=ida
The PBS allows also to avoid a particular cluster:
qsub -l select=X:ncpus=X:mem=Xmb:scratch_local=X:cluster=^ida
However it is not possible to combine conditions. If you e.g. want to avoid both ida and haldir, the following
qsub -l select=X:ncpus=X:mem=Xmb:scratch_local=X:cluster=^ida:cluster=^haldir
will not work. This is based on the principle that in PBS, every resource (in this case cluster
resource) can be specified only once.
On the other hand, cl_ida
and cl_haldir
are different resources, so:
qsub -l select=X:ncpus=X:mem=Xmb:scratch_local=X:cl_ida=False:cl_haldir=False
will work and will avoid both ida and haldir clusters.
The same will do with
qsub -l select=X:ncpus=X:mem=Xmb:scratch_local=X:cluster=^ida:cl_haldir=False
Note: As the list of node properties is made by hand, it is always possible that for some cluster the resource cl_SOMETHING
will be missing. Therefore
- always check a list of resources in https://metavo.metacentrum.cz/pbsmon2/props#prop2node
- write us to meta@cesnet.cz if you think some
cl_SOMETHING
is missing from the list. Thank you!
Resources of computational nodes
Provided list of attributes may not be complete. You can find actual list on the web in section Node properties
node with specific feature – value of a feature must be always specified (either True
or False
). Examples:
- -l select=1:ncpus=1:cluster=tarkil – request for a node from cluster tarkil
- -l select=1:ncpus=1:cluster=^tarkil – request for a node except cluster tarkil
request for specific node – always use shortened name. Example:
- -l select=1:ncpus=1:vnode=tarkil3 – request for node tarkil3.metacentrum.cz
exclude a specific node – always use shortened name. Example:
- -l select=1:ncpus=1:vnode=^elmo3-1 – exclude node elmo3-1.metacentrum.cz
request for a host – use full host name
- -l select=1:ncpus=1:host=tarkil3.grid.cesnet.cz
cgroups – request limiting memory usage by using cgroups, limiting memory by cgroups aren't enabled on all machine. Example:
- -l select=1:ncpus=1:mem=5gb:cgroups=memory
cgroups – request limiting CPU usage by using cgroups, limiting CPU by cgroups aren't enabled on all machine. Example:
- -l select=1:ncpus=1:mem=5gb:cgroups=cpuacct
networking cards – "-l place" is also used for infiniband request:
- -l select=3:ncpus=1 -l walltime=1:00:00 -l place=group=infiniband
CPU flags – limit submission on nodes with specific CPU flags
- -l select=cpu_flag=sse3
- List of available flags can be obtained by command pbsnodes -a | grep resources_available.cpu_flag | awk '{print $3}' | tr ',' '\n' | sort | uniq – this list is updated with every addition of some nodes or their removal. It is thus wise to check the available flags before you need anything special.
multicpu job on a same cluster
- qsub -l place=group=cluster
select nodes in specific location You we want to reserve 3 nodes in Pilsen, 1 processor on each node:
- qsub -l select=3:ncpus=1:plzen=True
Moving job to another queue
qmove uv@cerit-pbs.cerit-sc.cz 475337.cerit-pbs.cerit-sc.cz # move job 475337.cerit-pbs.cerit-sc.cz to a queue uv@cerit-pbs.cerit-sc.cz
Jobs can only be moved from one server to another if they are in the 'Q', 'H', or 'W' states, and only if there are no running subjobs. A job in the Running (R), Transiting (T), or Exiting (E) state cannot be moved. See list of queues.
GPU computing
For computing on GPU a gpu queue is used (specified can be either gpu or gpu_long). GPU queues are accessible for all MetaCentrum members, one gpu card is assigned by default. IDs of GPU cards are stored in CUDA_VISIBLE_DEVICES
variable.
- -l select=ncpus=1:ngpus=2 -q gpu
CPU speed
To require minimal CPU speed, use parameter spec. E.g.
- qsub -l select=1:ncpus=1:spec=4.8
will limit your selection to computational nodes with CPU speed scaled as 4.8 (as scaled by SPEC CPU2017) or higher. To see which machines comply to this criterion, go to qsub assembler and fill in the spec parameter only. Below you will get a table of machines matching your requirement.
Job Array
- The job array is submitted as:
# general command
$ qsub -J X-Y[:Z] script.sh
# example
$ qsub -J 2-7:2 script.sh
- X is first index of a job, Y is upper border of an index and Z in optional parameter of an index step, therefore the example command will generate subjobs with indexes 2,4,6.
- The job array is represented by a single job whose job number is followed by "[]", this main job provides an overview of unfinished sub jobs.
$ qstat -f 969390'[]' -x | grep array_state_count array_state_count = Queued:0 Running:0 Exiting:0 Expired:0
- An example of sub job ID is 969390[1].meta-pbs.metacentrum.cz.
- The sub job can be queried by a qstat command (qstat -t).
- PBS Pro uses
PBS_ARRAY_INDEX
instead of Torque's PBS_ARRAYID inside of a sub job. The varibalePBS_ARRAY_ID
contains job ID of the main job.
Parallelization
- How many MPI processes would run on one chunk is specified by mpiprocs=[number].
- For each MPI process there is one line in nodefile
$PBS_NODEFILE
that specifies allocated vnode.- -l select=3:ncpus=2:mpiprocs=2 – 6 MPI processes (nodefile contains 6 lines with names of vnodes), 2 MPI processes always share 1 vnode with 2 CPU
- How many OpenMP threads would run in 1 chunk (ompthreads=[number]), 2 omp threads on 1 chunks is default behaviour (ompthreads = ncpus)
Default working directory setting in shell
for shell in tmux -- add the following command to the .bashrc in your /home directory
case "$-" in *i*) cd /storage/brno1/home/LOGNAME/ ;; esac
for /home directory
case "$-" in *i*) cd ;; esac
Multiprocessor task
Multiprocessor tasks usually use commercial software which is already prepared to use more processors, see following example of Gaussian task. Despite of this fact you can create a multiprocessor task on you own. You must specify the queue and the number of processors while submitting the job, for example: queue for a job which last at maximum 2 hours and 3 machines, 2 processors on each
qsub -l walltime=2:0:0 -l select=3:ncpus=2 soubor.sh
Your requests can be also specified in executing file file.sh
using lines which start with #PBS
, for example:
#!/bin/sh #PBS -N myJobName #PBS -l 2:0:0 #PBS -l select=3:ncpus=2 #PBS -j oe #PBS -m ae # # describtion from 'man qsub': # -N ... declares a name for the job. The name specified may be up to and including 15 characters in length. It # must consist of printable, non white space characters with the first character alphabetic. # -q ... defines the destination of the job (queue) # -l ... defines the resources that are required by the job # -j oe ... standard error stream of the job will be merged with the standard output stream # -m ae ... mail is sent when the job aborts or terminates
and then job can be executed by typing
qsub soubor.sh
After that the script is executed on one allocated machines. List of allocated machines can be found in the file name of which is stored in system variable $PBS_NODEFILE
. Another procecces can be executed on another allocated machined by pbsdsh
command. (For more details look at man pages: man qsub
and man pbsdsh
.) For example:
#!/bin/sh #PBS -l select=3:ncpus=2 #PBS -j oe #PBS -m e # #in file name of which can be found in variable PBS_NODEFILE, is list of allocated machines echo '***PBS_NODEFILE***START*******' cat $PBS_NODEFILE echo '***PBS_NODEFILE***END*********' # spusti dany prikaz na vsech pridelenych strojich pbsdsh -- uname -a
Submit, wait until job's complete (status C) a check throug output:
$ qsub file.sh 272154.skirit-f.ics.muni.cz $ qstat 272154.skirit-f.ics.muni.cz Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 272154.skirit-f file.sh makub 00:00:00 R short $ qstat 272154.skirit-f.ics.muni.cz Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 272154.skirit-f file.sh makub 00:00:00 C short $ cat soubor.sh.o272154 ***PBS_NODEFILE***START******* skirit4.ics.muni.cz skirit7.ics.muni.cz skirit8.ics.muni.cz skirit4.ics.muni.cz skirit7.ics.muni.cz skirit8.ics.muni.cz ***PBS_NODEFILE***END********* Linux skirit4.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux Linux skirit8.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux Linux skirit4.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux Linux skirit7.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux Linux skirit8.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux Linux skirit7.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
Observe that the request nodes=3:ppn=2
was handled by allocation of machines skirit4, skirit7 a skirit8. Each of them appeared twice because two processors had been requested.
How to avoid the migration of jobs between PBS servers
Submitted jobs may be automatically moved to a different PBS server. It typically occurs if one PBS server is overloaded (and new jobs are queueing) and another PBS server has free resources. Users are allowed to specify a resource. But in some cases, this automatic transfer may be undesirable (e.g. differences in Debian) and should be prevented. Limiting to a specific PBS server can be done using the pbs_server
resource.
pbs_server=<full_pbs_server_name>
Following example qsub -l select=ncpus=1:pbs_server=meta-pbs.metacentrum.cz ...
will limit a job on Metacentrum nodes.