About scheduling system

From MetaCentrum
Jump to navigation Jump to search

Batch jobs system PBS (Portable Batch System) allows interactive and non-interactive running of jobs with different requirements in such a way that shared resources are used rationally and equally. It prevents monopolization of resources by specific jobs or specific users. This document describes PBS usage in Metacentrum environment.

Basic concepts - jobs, queues, resources

Basic term of PBS system from the user point of view is a computational job. The job can be batch (does not require terminal, user will not monitor it) or interactive, one- or many-processors.

Jobs are inserted into so-called queues of PBS server using command qsub, where they wait for a situation suitable for their run especially from system load point of view. An user specifies job requirements for resources and target queue and consequently PBS runs jobs according to current state of computational nodes.

Computational nodes have properties describing their architecture, operation system, special networking equipment or special access restrictions. Properties can be listed using program pbsnodes, they are also visible at METACentrum web and are described in more details below.

Jobs wait for execution in queues that can have different priorities concerning jobs execution. Freely available queues differ in maximal allowed duration of job run. Other queues are dedicated to special projects or user groups (ncbr, iti, gridlab, ...) or for special types of jobs (gpu, ...). Some queues are available only for listed users or groups. Queues can have set limits concerning maximum of running jobs, ... and so on.

Available queues

Current list of queues including their set up can be found in Pbsmon application or by qstat -q or qstat -Q commands. Queues with limited access are marked with lock. Below is the list of most used queues.

Jobs are sorted to queues according to mandatory entered maximal running time of the job (2h, 4h, 1 day, 2 days, 4 days, 1 week, 2 weeks and more than 2 weeks).

Queue backfill is designed for "filling" jobs. It has a low priority, jobs can be killed while needed, only one machine is allowed for computing, but there can be lots of them.

Queue preemptible allows users to use owner reserved clusters where no common queues are available. Cluster owners are able to stop the other people's running jobs and run their own jobs immediately. Jobs can be suspended up to 30 days.

ZarovkaMala.png Note: Every single job requires some resources on the scheduler site. In case of very short jobs, the planning may take longer than the job itself. Therefore, if you need to submit many ( more then thousand) short (less than 10 minutes) jobs, we strongly recommend to run them in batches submitted as one job. To prevent PBS server glutting, there is a quota of 10 000 jobs (running or queuing) per user.

Basic commands

Jobs are submitted using command qsub. The command returns job identifier that is used track and manipulate the job during its lifecycle. The qsub command also checks if user has valid Kerberos tickets (and PBS will guarantee that a job will have the tickets during its execution). Program output is saved in a folder from which the job was submitted. This behaviour can be changed by parameter -j of program qsub. It is useful to specify requirements for computational resources during job submission otherwise default values - 1 computing node, 1 CPU, memory 400 MB, time limit for job run 24 hours - are used. When the time limit is exceeded, the job is terminated.

  • qsub – submit the job to the queue
  • qdel – cancellation of waiting or running job
  • qmove – move the job to another queue
  • qfree or pbsnodes – current node state and it's properties
  • qstat or qstat -f – current state of jobs
  • qfree – overview of free PBS resources
  • The volume of displayed information can be adjusted by -v and -vv parameters.

Current state of queues, clusters and jobs is synoptically available also on MetaCentrum web in Current state section.

qsub options

Command qsub serves for submitting a job into a queue. Common syntax for qsub command can be expressed as (-l (lowercase "L") is a separator between options):

qsub [-q queue] -l resource=value [-l resource2=value2] ... [-l resourceN=valueN] script

The switch -q specifies a queue where the job is enqueued in. If an option -q is missing, the default queue will be used and the job will be enqueued in the first available queue (the most often long/normal/short according to the entered duration of the job).

The argument script has to be a name of file containing shell script (implicitly /bin/sh script), that will be interpreted during job execution. The options of qsub command can be inserted into comments of this script (lines with a sign hash ("#") in the first column). If an argument script is missing, the qsub command will read this script from the standard input.

The above general qsub command can look, for example, like this:

qsub -l select=1:ncpus=1:mem=1gb:scratch_local=10gb -l walltime=1:00:00 script.sh

Number of machines and processors – number of processors and "chunks" is set with -l select=[number]:ncpus=[number], terminology of PBS Pro defines "chunk" as further indivisible set of resources allocated to a job on 1 physical node, job with more chunks is analogy of job with more nodes in the old TORQUE system. Chunks can be on one machine next to each other or conversely always on different machines, eventually they can be placed according to available resources. Note that only one select argument is allowed at a time. Examples:

  • -l select=1:ncpus=2 – two processors on one chunk
  • -l select=2:ncpus=1 – two chunks each with one processor
  • -l select=1:ncpus=1+1:ncpus=2 – two chunks, one with one processor and second with two processors
  • -l select=2:ncpus=1 -l place=pack – all chunks must be on one node (if there is not any big enough node, the job will never run)
  • -l select=2:ncpus=1 -l place=scatter – each chunk will be placed on different node (default for old TORQUE system)
  • -l select=2:ncpus=1 -l place=free – permission to plane chunks on nodes arbitrarily, according to actual resource availability on nodes (chunks can be on one or more nodes, default behavior for PBS Pro):
If you are not sure about the number of needed processors, ask for an exclusive reservation of the whole machine using the parameter "-l place=":
  • -l select=2:ncpus=1 -l place=exclhost – request for 2 exclusive nodes (without cpu and mem limit control)
  • -l select=3:ncpus=1 -l place=scatter:excl – it is possible to combine exclusivity with specification of chunk planning
  • -l select=102:place=group=cluster – 102 cpus on one cluster

Amount of temporary scratch – fast and reliable disk space for temporary files. Always specify type and size of scratch, job has no default scratch assigned. Scratch type can be one of scratch_local|scratch_ssd|scratch_shared. Examples:

  • -l select=1:ncpus=1:mem=4gb:scratch_local=10gb
  • -l select=1:ncpus=1:mem=4gb:scratch_ssd=1gb
  • -l select=1:ncpus=1:mem=4gb:scratch_shared=1gb
After the request for scratch if specified, following variables are present in work environment: $SCRATCH_VOLUME=<dedicated capacity>, $SCRATCHDIR=<directory>, $SCRATCH_TYPE=<scratch_local|scratch_ssd|scratch_shared>

Amount of needed memory – job is implicitly assigned with 400MB of memory if not specified otherwise. Examples:

  • -l select=1:ncpus=1:mem=1gb
  • -l select=1:ncpus=1:mem=10gb
  • -l select=1:ncpus=1:mem=200mb

Maximal duration of a job – is set by -l walltime=[[hh:]mm:]ss, default walltime is 24:00:00. Queues q_* (such as q_2h, q_2d etc.) are not accessible for submit jobs, rout queue (default) automatically chose appropriate time queue based on specified walltime. Examples:

  • -l walltime=1:00:00 (one hour)
  • -l walltime=24:00:00 (one day)
  • -l walltime=120:00:00 (5 days)

Licence – is set by parameter -l

  • -l select=3:ncpus=1 -l walltime=1:00:00 -l matlab=1 – one licence for Matlab

Sending the information emails about the job state

  • -m abe – sends an email when the job aborts (a), begins (b) and completes/ends (e)

You can use the tool Command qsub refining for condition definition.

Advanced options

How to choose specific queue or PBS server

If you need to send the job to a specific queue and/or specific PBS server, use the qsub -q destination option.

The argument destination can be one of the following:

queue@server # specific queue on specific server,
queue # specific queue on the current (default) server,
@server # default queue on specific server.

E. g. qsub -q oven@meta-pbs.metacentrum.cz will send the job to a queue oven on server meta-pbs.metacentrum.cz. Similarly, qsub -q @cerit-pbs.cerit-sc.cz will send the job to default queue managed by cerit pbs server, no matter which frontend the job is sent from.

How to submit a job on special nodes with a particular OS

To submit a job to a machine with Debian9, please use "os=debian9", or "os=centos7" in job specification:

 zuphux$ qsub -l select=1:ncpus=2:mem=1gb:scratch_local=1gb:os=debian9 …

To submit a job to a machine with any Debian*, please use "osfamily=debian", or "osfamily=redhat" (for rhel or centos) in job specification:

 zuphux$ qsub -l select=1:ncpus=2:mem=1gb:scratch_local=1gb:osfamily=debian …

To run tasks on a machine with any OS, type "os = ^ any"

 zuphux$ qsub -l select=1:ncpus=2:mem=1gb:scratch_local=1gb:os=^any …

If you experience any problem with libraries or applications compatibility in Debian9, please add module debian8-compat.

Resources of computational nodes

Provided list of attributes may not be complete. You can find actual list on the web in section Node properties

node with specific featurevalue of a feature must be always specified (either True or False). Examples:

-l select=1:ncpus=1:cluster=tarkil – request for a node from cluster tarkil
-l select=1:ncpus=1:cluster=^tarkil – request for a node except cluster tarkil

request for specific node – always use shortened name. Example:

-l select=1:ncpus=1:vnode=tarkil3 – request for node tarkil3.metacentrum.cz

request for a host – use full host name

-l select=1:ncpus=1:host=tarkil3.grid.cesnet.cz

cgroups – request limiting memory usage by using cgroups, limiting memory by cgroups aren't enabled on all machine. Example:

-l select=1:ncpus=1:mem=5gb:cgroups=memory

cgroups – request limiting CPU usage by using cgroups, limiting CPU by cgroups aren't enabled on all machine. Example:

-l select=1:ncpus=1:mem=5gb:cgroups=cpuacct

networking cards – "-l place" is also used for infiniband request:

-l select=3:ncpus=1 -l walltime=1:00:00 -l place=group=infiniband

CPU flags – limit submission on nodes with specific CPU flags

-l select=cpu_flag=sse3
List of available flags can be obtained by command pbsnodes -a | grep resources_available.cpu_flag | awk '{print $3}' | tr ',' '\n' | sort | uniq – this list is updated with every addition of some nodes or their removal. It is thus wise to check the available flags before you need anything special.

multicpu job on a same cluster

qsub -l place=group=cluster

select nodes in specific location You we want to reserve 3 nodes in Pilsen, 1 processor on each node:

qsub -l select=3:ncpus=1:plzen=True

Moving job to another queue

qmove uv@cerit-pbs.cerit-sc.cz 475337.cerit-pbs.cerit-sc.cz # move job 475337.cerit-pbs.cerit-sc.cz to a queue uv@cerit-pbs.cerit-sc.cz

Jobs can only be moved from one server to another if they are in the 'Q', 'H', or 'W' states, and only if there are no running subjobs. A job in the Running (R), Transiting (T), or Exiting (E) state cannot be moved. See list of queues.

GPU computing

For computing on GPU a gpu queue is used (specified can be either gpu or gpu_long). GPU queues are accessible for all MetaCentrum members, one gpu card is assigned by default. IDs of GPU cards are stored in CUDA_VISIBLE_DEVICES variable.

  • -l select=ncpus=1:ngpus=2 -q gpu

Job Array

  • The job array is submitted as:
 # general command
 $ qsub -J X-Y[:Z] script.sh
 # example
 $ qsub -J 2-7:2 script.sh
  • X is first index of a job, Y is upper border of an index and Z in optional parameter of an index step, therefore the example command will generate subjobs with indexes 2,4,6.
  • The job array is represented by a single job whose job number is followed by "[]", this main job provides an overview of unfinished sub jobs.
$ qstat -f 969390'[]' -x | grep array_state_count
    array_state_count = Queued:0 Running:0 Exiting:0 Expired:0 
  • An example of sub job ID is 969390[1].meta-pbs.metacentrum.cz.
  • The sub job can be queried by a qstat command (qstat -t).
  • PBS Pro uses PBS_ARRAY_INDEX instead of Torque's PBS_ARRAYID inside of a sub job. The varibale PBS_ARRAY_ID contains job ID of the main job.

MPI processes

  • How many MPI processes would run on one chunk is specified by mpiprocs=[number].
  • For each MPI process there is one line in nodefile $PBS_NODEFILE that specifies allocated vnode.
    • -l select=3:ncpus=2:mpiprocs=2 – 6 MPI processes (nodefile contains 6 lines with names of vnodes), 2 MPI processes always share 1 vnode with 2 CPU
  • How many OpenMP threads would run in 1 chunk (ompthreads=[number]), 2 omp threads on 1 chunks is default behaviour (ompthreads = ncpus)

Default working directory setting in shell

for shell in tmux -- add the following command to the .bashrc in your /home directory

    case "$-" in *i*) cd /storage/brno1/home/LOGNAME/ ;; esac

for /home directory

    case "$-" in *i*) cd ;; esac

Multiprocessor task

Multiprocessor tasks usually use commercial software which is already prepared to use more processors, see following exaple of gaussian task. Despite of this fact you can create a multiprocessor task on you own. You must specify the queue and the number of processors while submitting the job, for example: queue for a job which last at maximum 2 hours and 3 machines, 2 processors on each

qsub -l walltime=2:0:0 -l select=3:ncpus=2 soubor.sh

Your requests can be also specified in executing file file.sh using lines which start with #PBS, for example:

 #PBS -N myJobName
 #PBS -l 2:0:0
 #PBS -l select=3:ncpus=2
 #PBS -j oe
 #PBS -m ae
 # describtion from 'man qsub':
 # -N ... declares a name for the job.  The name specified may be up to and including 15 characters in  length.   It
 #        must consist of printable, non white space characters with the first character alphabetic.
 # -q ... defines the destination of the job (queue)
 # -l ... defines  the  resources that are required by the job
 # -j oe ... standard error stream of the job will be merged with the standard output stream
 # -m ae ...  mail is sent when the job aborts or terminates

and then job can be executed by typing

qsub soubor.sh

After that the script is executed on one allocated machines. List of allocated machines can be found in the file name of which is stored in system variable $PBS_NODEFILE. Another procecces can be executed on another allocated machined by pbsdsh command. (For more details look at man pages: man qsub and man pbsdsh.) For example:

 #PBS -l select=3:ncpus=2
 #PBS -j oe
 #PBS -m e
 #in file name of which can be found in variable PBS_NODEFILE, is list of allocated machines
 echo '***PBS_NODEFILE***START*******'
 echo '***PBS_NODEFILE***END*********'
 # spusti dany prikaz na vsech pridelenych strojich
 pbsdsh -- uname -a

Submit, wait until job's complete (status C) a check throug output:

$ qsub file.sh

$ qstat 272154.skirit-f.ics.muni.cz
 Job id           Name             User             Time Use S Queue
 ---------------- ---------------- ----------------  -------- - -----
 272154.skirit-f   file.sh         makub            00:00:00 R short
 $ qstat 272154.skirit-f.ics.muni.cz
 Job id           Name             User             Time Use S Queue
 ---------------- ---------------- ----------------  -------- - -----
 272154.skirit-f   file.sh         makub            00:00:00 C short
 $ cat soubor.sh.o272154
 Linux skirit4.ics.muni.cz #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
 Linux skirit8.ics.muni.cz #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
 Linux skirit4.ics.muni.cz #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
 Linux skirit7.ics.muni.cz #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
 Linux skirit8.ics.muni.cz #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux
 Linux skirit7.ics.muni.cz #1 SMP Thu Aug 3 11:16:56 CEST 2006 i686 GNU/Linux

Observe that the request nodes=3:ppn=2 was handled by allocation of machines skirit4, skirit7 a skirit8. Each of them appeared twice because two processors had been requested.