About scheduling system

Z MetaCentrum
Skočit na navigaci Skočit na vyhledávání
Metacentrum wiki is deprecated after March 2023
Dear users, due to integration of Metacentrum into https://www.e-infra.cz/en (e-INFRA CZ service), the documentation for users will change format and site.
The current wiki pages won't be updated after end of March 2023. They will, however, be kept for a few months for backwards reference.
The new documentation resides at https://docs.metacentrum.cz.

Batch jobs system PBS (Portable Batch System) allows interactive and non-interactive running of jobs with different requirements in such a way that shared resources are used rationally and equally. It prevents monopolization of resources by specific jobs or specific users. This document describes PBS usage in Metacentrum environment.

Basic concepts - jobs, queues, resources

Basic term of PBS system from the user point of view is a computational job. The job can be batch (pre-prepared, non-interactive) or interactive, one- or many-processors.

After the job is submitted, PBS returns job identifier that is used track and manipulate the job during its lifecycle. The qsub command also checks if user has valid Kerberos tickets (and PBS will guarantee that a job will have the tickets during its execution). Program output is saved in a folder from which the job was submitted.

Jobs are inserted into so-called queues of PBS server using command qsub, where they wait for a situation suitable for their run especially from system load point of view. Freely available queues differ in maximal allowed duration of job run. Other queues are dedicated to special projects or user groups (ncbr, iti, gridlab, ...) or for special types of jobs (gpu, ...). Some queues are available only for listed users or groups.

Computational nodes have limited resources, such as memory, CPUs, duration of the job or disk space, but a resource is also a licence for specific software or a place where the node physically resides. A job can be run only after all resources required by the job are free. When any resource (most typically the duration of the job) is exceeded, the job is forcefully terminated by PBS.

ZarovkaMala.png Note: Every single job requires some resources on the scheduler site. In case of very short jobs, the planning may take longer than the job itself. Therefore, if you need to submit many ( more then thousand) short (less than 10 minutes) jobs, we strongly recommend to run them in batches submitted as one job. To prevent PBS server glutting, there is a quota of 10 000 jobs (running or queuing) per user.

Useful links

Computational nodes have properties describing their architecture, operation system, special networking equipment or special access restrictions. Properties are visible at Metacentrum web.

You can use the tool Command qsub refining for condition definition. Current state of queues, clusters and jobs is synoptically available also on MetaCentrum web in Current state section.

Types of queues

Current list of queues including their set up can be found in Pbsmon application or by qstat -q or qstat -Q commands. Queues with limited access are marked with lock. Below is the list of most used queues.

Jobs are sorted to queues according to mandatory entered maximal running time of the job (2h, 4h, 1 day, 2 days, 4 days, 1 week, 2 weeks and more than 2 weeks).

Queue backfill is designed for "filling" jobs. It has a low priority, jobs can be killed while needed, only one machine is allowed for computing, but there can be lots of them.

Queue preemptible allows users to use owner reserved clusters where no common queues are available. Cluster owners are able to stop the other people's running jobs and run their own jobs immediately. Jobs can be suspended up to 30 days.


PBS basic commands

  • qsub – submit the job to the queue
  • qdel – cancel a waiting or running job
  • qmove – move the job to another queue (only waiting jobs!)
  • pbsnodes – get current node state and its properties
  • qstat – current state of jobs

The volume of displayed information can be adjusted by -v and -vv parameters.

qstat options

qsub options

Command qsub serves for submitting a job into a queue. Common syntax for qsub command can be expressed as (-l (lowercase "L") is a separator between options):

qsub [-q queue] -l resource=value [-l resource2=value2] ... [-l resourceN=valueN] script

The switch -q specifies a queue where the job is enqueued in. If an option -q is missing, the queue "default" will be used.

The argument script has to be a name of file containing shell script (implicitly /bin/sh script), that will be interpreted during job execution. The options of qsub command can be inserted into comments of this script (lines with a sign hash ("#") in the first column). If an argument script is missing, the qsub command will read this script from the standard input.

The above general qsub command can look, for example, like this:

qsub -l select=1:ncpus=1:mem=1gb:scratch_local=10gb -l walltime=1:00:00 script.sh

How to set number of nodes and processors

Number of processors and "chunks" is set with -l select=[number]:ncpus=[number]. Terminology of PBS Pro defines "chunk" as further indivisible set of resources allocated to a job on 1 physical node. Chunks can be on one machine next to each other or conversely always on different machines, eventually they can be placed according to available resources. Note that only one select argument is allowed at a time. Examples:

  • -l select=1:ncpus=2 – two processors on one chunk
  • -l select=2:ncpus=1 – two chunks each with one processor
  • -l select=1:ncpus=1+1:ncpus=2 – two chunks, one with one processor and second with two processors
  • -l select=2:ncpus=1 -l place=pack – all chunks must be on one node (if there is not any big enough node, the job will never run)
  • -l select=2:ncpus=1 -l place=scatter – each chunk will be placed on different node
  • -l select=2:ncpus=1 -l place=free – permission to plane chunks on nodes arbitrarily, according to actual resource availability on nodes (default behavior for PBS Pro):

If you are not sure about the number of needed processors, ask for an exclusive reservation of the whole machine using the parameter "-l place=":

  • -l select=2:ncpus=1 -l place=exclhost – request for 2 exclusive nodes (without cpu and mem limit control)
  • -l select=3:ncpus=1 -l place=scatter:excl – it is possible to combine exclusivity with specification of chunk planning
  • -l select=102:place=group=cluster – 102 cpus on one cluster

How to set the size of scratch

Scratch directory is a disk space on current computational node used to store temporary files. Always specify type and size of scratch, PBS has no default scratch assigned. Scratch type can be one of scratch_local|scratch_ssd|scratch_shared. Examples:

  • -l select=1:ncpus=1:mem=4gb:scratch_local=10gb
  • -l select=1:ncpus=1:mem=4gb:scratch_ssd=1gb
  • -l select=1:ncpus=1:mem=4gb:scratch_shared=1gb

After the request for scratch if specified, following variables are present in work environment:

  • $SCRATCH_VOLUME = size of scratch
  • $SCRATCHDIR = path to scratch directory
  • $SCRATCH_TYPE = either of scratch_local,scratch_ssd, scratch_shared

How to set the memory

Amount of needed memory – job is implicitly assigned with 400MB of memory if not specified otherwise. Examples:

  • -l select=1:ncpus=1:mem=10gb
  • -l select=1:ncpus=1:mem=200mb

How to set the duration of a job (walltime)

Maximal duration of a job is set by -l walltime=[[hh:]mm:]ss, default walltime is 24:00:00. Queues q_* (such as q_2h, q_2d etc.) are not accessible for submit jobs, rout queue (default) automatically chose appropriate time queue based on specified walltime. Examples:

  • -l walltime=1:00:00 (one hour)
  • -l walltime=120:00:00 (5 days)

How to reserve a licence

Some software require licence. Licence is set by parameter -l

  • -l select=3:ncpus=1 -l walltime=1:00:00 -l matlab=1 – one licence for Matlab

How to setup email notification about job state

PBS server sends email notification when the job changes state.

  • -m a send mail when job is aborted by batch system
  • -m b send mail when job begins execution
  • -m e send mail when job ends execution
  • -m n do not send mail

The options a, b, e can be combined, e.g.:

  • -m abe – sends an email when the job aborts (a), begins (b) and completes/ends (e)

The email can be sent to any email address using the -M option:

  • -M james@pbspro.com

How to save the output elsewhere

By default the job output (output, and error files) is saved in a folder from which the job was submitted (variable PBS_O_WORKDIR).

This behaviour for output, resp. error files can be changed by parameters -o, resp -e.

  • -o /custom-path/myOutputFile
  • -e /custom-path/myErrorFile

How to choose specific queue or PBS server

If you need to send the job to a specific queue and/or specific PBS server, use the qsub -q destination option.

The argument destination can be one of the following:

queue@server # specific queue on specific server,
queue # specific queue on the current (default) server,
@server # default queue on specific server.

E. g. qsub -q oven@meta-pbs.metacentrum.cz will send the job to a queue oven on server meta-pbs.metacentrum.cz. Similarly, qsub -q @cerit-pbs.cerit-sc.cz will send the job to default queue managed by cerit pbs server, no matter which frontend the job is sent from.

How to submit a job on nodes with a particular OS

To submit a job to a machine with Debian9, please use "os=debian9", or "os=centos7" in job specification:

 zuphux$ qsub -l select=1:ncpus=2:mem=1gb:scratch_local=1gb:os=debian9 …

To submit a job to a machine with any Debian*, please use "osfamily=debian", or "osfamily=redhat" (for rhel or centos) in job specification:

 zuphux$ qsub -l select=1:ncpus=2:mem=1gb:scratch_local=1gb:osfamily=debian …

To run tasks on a machine with any OS, type "os = ^ any"

 zuphux$ qsub -l select=1:ncpus=2:mem=1gb:scratch_local=1gb:os=^any …

If you experience any problem with libraries or applications compatibility in Debian9, please add module debian8-compat.

How to choose/avoid a particular cluster

The PBS allows you to choose a particular cluster:

qsub -l select=X:ncpus=X:mem=Xmb:scratch_local=X:cluster=ida

The PBS allows also to avoid a particular cluster:

qsub -l select=X:ncpus=X:mem=Xmb:scratch_local=X:cluster=^ida

However it is not possible to combine conditions. If you e.g. want to avoid both ida and haldir, the following

qsub -l select=X:ncpus=X:mem=Xmb:scratch_local=X:cluster=^ida:cluster=^haldir

will not work. This is based on the principle that in PBS, every resource (in this case cluster resource) can be specified only once.

On the other hand, cl_ida and cl_haldir are different resources, so:

qsub -l select=X:ncpus=X:mem=Xmb:scratch_local=X:cl_ida=False:cl_haldir=False

will work and will avoid both ida and haldir clusters.

The same will do with

qsub -l select=X:ncpus=X:mem=Xmb:scratch_local=X:cluster=^ida:cl_haldir=False

Note: As the list of node properties is made by hand, it is always possible that for some cluster the resource cl_SOMETHINGwill be missing. Therefore

Resources of computational nodes

Provided list of attributes may not be complete. You can find actual list on the web in section Node properties

node with specific featurevalue of a feature must be always specified (either True or False). Examples:

-l select=1:ncpus=1:cluster=tarkil – request for a node from cluster tarkil
-l select=1:ncpus=1:cluster=^tarkil – request for a node except cluster tarkil

request for specific node – always use shortened name. Example:

-l select=1:ncpus=1:vnode=tarkil3 – request for node tarkil3.metacentrum.cz

exclude a specific node – always use shortened name. Example:

-l select=1:ncpus=1:vnode=^elmo3-1 – exclude node elmo3-1.metacentrum.cz

request for a host – use full host name

-l select=1:ncpus=1:host=tarkil3.grid.cesnet.cz

cgroups – request limiting memory usage by using cgroups, limiting memory by cgroups aren't enabled on all machine. Example:

-l select=1:ncpus=1:mem=5gb:cgroups=memory

cgroups – request limiting CPU usage by using cgroups, limiting CPU by cgroups aren't enabled on all machine. Example:

-l select=1:ncpus=1:mem=5gb:cgroups=cpuacct

networking cards – "-l place" is also used for infiniband request:

-l select=3:ncpus=1 -l walltime=1:00:00 -l place=group=infiniband

CPU flags – limit submission on nodes with specific CPU flags

-l select=cpu_flag=sse3
List of available flags can be obtained by command pbsnodes -a | grep resources_available.cpu_flag | awk '{print $3}' | tr ',' '\n' | sort | uniq – this list is updated with every addition of some nodes or their removal. It is thus wise to check the available flags before you need anything special.

multicpu job on a same cluster

qsub -l place=group=cluster

select nodes in specific location You we want to reserve 3 nodes in Pilsen, 1 processor on each node:

qsub -l select=3:ncpus=1:plzen=True

Moving job to another queue

qmove uv@cerit-pbs.cerit-sc.cz 475337.cerit-pbs.cerit-sc.cz # move job 475337.cerit-pbs.cerit-sc.cz to a queue uv@cerit-pbs.cerit-sc.cz

Jobs can only be moved from one server to another if they are in the 'Q', 'H', or 'W' states, and only if there are no running subjobs. A job in the Running (R), Transiting (T), or Exiting (E) state cannot be moved. See list of queues.

GPU computing

For computing on GPU a gpu queue is used (specified can be either gpu or gpu_long). GPU queues are accessible for all MetaCentrum members, one gpu card is assigned by default. IDs of GPU cards are stored in CUDA_VISIBLE_DEVICES variable.

  • -l select=ncpus=1:ngpus=2 -q gpu

CPU speed

To require minimal CPU speed, use parameter spec. E.g.

qsub -l select=1:ncpus=1:spec=4.8

will limit your selection to computational nodes with CPU speed scaled as 4.8 (as scaled by SPEC CPU2017) or higher. To see which machines comply to this criterion, go to qsub assembler and fill in the spec parameter only. Below you will get a table of machines matching your requirement.

Job Array

  • The job array is submitted as:
 # general command
 $ qsub -J X-Y[:Z] script.sh
 # example
 $ qsub -J 2-7:2 script.sh
  • X is first index of a job, Y is upper border of an index and Z in optional parameter of an index step, therefore the example command will generate subjobs with indexes 2,4,6.
  • The job array is represented by a single job whose job number is followed by "[]", this main job provides an overview of unfinished sub jobs.
$ qstat -f 969390'[]' -x | grep array_state_count
    array_state_count = Queued:0 Running:0 Exiting:0 Expired:0 
  • An example of sub job ID is 969390[1].meta-pbs.metacentrum.cz.
  • The sub job can be queried by a qstat command (qstat -t).
  • PBS Pro uses PBS_ARRAY_INDEX instead of Torque's PBS_ARRAYID inside of a sub job. The varibale PBS_ARRAY_ID contains job ID of the main job.

Parallelization

  • How many MPI processes would run on one chunk is specified by mpiprocs=[number].
  • For each MPI process there is one line in nodefile $PBS_NODEFILE that specifies allocated vnode.
    • -l select=3:ncpus=2:mpiprocs=2 – 6 MPI processes (nodefile contains 6 lines with names of vnodes), 2 MPI processes always share 1 vnode with 2 CPU
  • How many OpenMP threads would run in 1 chunk (ompthreads=[number]), 2 omp threads on 1 chunks is default behaviour (ompthreads = ncpus)

Default working directory setting in shell

for shell in tmux -- add the following command to the .bashrc in your /home directory

    case "$-" in *i*) cd /storage/brno1/home/LOGNAME/ ;; esac

for /home directory

    case "$-" in *i*) cd ;; esac

Multiprocessor task

Multiprocessor tasks usually use commercial software which is already prepared to use more processors, see following example of Gaussian task. Despite of this fact you can create a multiprocessor task on you own. You must specify the queue and the number of processors while submitting the job, for example: queue for a job which last at maximum 2 hours and 3 machines, 2 processors on each

qsub -l walltime=2:0:0 -l select=3:ncpus=2 soubor.sh

Your requests can be also specified in executing file file.sh using lines which start with #PBS, for example:

 #!/bin/sh
 #PBS -N myJobName
 #PBS -l 2:0:0
 #PBS -l select=3:ncpus=2
 #PBS -j oe
 #PBS -m ae
 #
 # describtion from 'man qsub':
 # -N ... declares a name for the job.  The name specified may be up to and including 15 characters in  length.   It
 #        must consist of printable, non white space characters with the first character alphabetic.
 # -q ... defines the destination of the job (queue)
 # -l ... defines  the  resources that are required by the job
 # -j oe ... standard error stream of the job will be merged with the standard output stream
 # -m ae ...  mail is sent when the job aborts or terminates
 

and then job can be executed by typing

qsub soubor.sh

After that the script is executed on one allocated machines. List of allocated machines can be found in the file name of which is stored in system variable $PBS_NODEFILE. Another procecces can be executed on another allocated machined by pbsdsh command. (For more details look at man pages: man qsub and man pbsdsh.) For example:

 #!/bin/sh
 #PBS -l select=3:ncpus=2
 #PBS -j oe
 #PBS -m e
 #
 #in file name of which can be found in variable PBS_NODEFILE, is list of allocated machines
 echo '***PBS_NODEFILE***START*******'
 cat $PBS_NODEFILE
 echo '***PBS_NODEFILE***END*********'
 
 # spusti dany prikaz na vsech pridelenych strojich
 pbsdsh -- uname -a
 

Submit, wait until job's complete (status C) a check throug output:

$ qsub file.sh
272154.skirit-f.ics.muni.cz

$ qstat 272154.skirit-f.ics.muni.cz
 Job id           Name             User             Time Use S Queue
 ---------------- ---------------- ----------------  -------- - -----
 272154.skirit-f   file.sh         makub            00:00:00 R short
 
 $ qstat 272154.skirit-f.ics.muni.cz
 Job id           Name             User             Time Use S Queue
 ---------------- ---------------- ----------------  -------- - -----
 272154.skirit-f   file.sh         makub            00:00:00 C short
 
 $ cat soubor.sh.o272154
 ***PBS_NODEFILE***START*******
 skirit4.ics.muni.cz
 skirit7.ics.muni.cz
 skirit8.ics.muni.cz
 skirit4.ics.muni.cz
 skirit7.ics.muni.cz
 skirit8.ics.muni.cz
 ***PBS_NODEFILE***END*********
 Linux skirit4.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
 Linux skirit8.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
 Linux skirit4.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
 Linux skirit7.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
 Linux skirit8.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
 Linux skirit7.ics.muni.cz 2.6.17.7 #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
 

Observe that the request nodes=3:ppn=2 was handled by allocation of machines skirit4, skirit7 a skirit8. Each of them appeared twice because two processors had been requested.

How to avoid the migration of jobs between PBS servers

Submitted jobs may be automatically moved to a different PBS server. It typically occurs if one PBS server is overloaded (and new jobs are queueing) and another PBS server has free resources. Users are allowed to specify a resource. But in some cases, this automatic transfer may be undesirable (e.g. differences in Debian) and should be prevented. Limiting to a specific PBS server can be done using the pbs_server resource.

pbs_server=<full_pbs_server_name>

Following example qsub -l select=ncpus=1:pbs_server=meta-pbs.metacentrum.cz ... will limit a job on Metacentrum nodes.