Beginners guide

From MetaCentrum
Jump to navigation Jump to search

(Česká verze)

Grid computing: basic idea

Related topics
Grids and supercomputers
Scheduling system
Resources and queues

MetaCentrum offers resources for a so-called grid computing. Roughly speaking, a grid is a network of many interconnected computers, whose properties (type and size of disk, RAM, CPU, GPU etc.) may differ and which may be located in different places (cities, institutes).

The scheduling system keeps track of the grid's resources (memory, CPU time, disk space) and keeps the computational jobs waiting in queues until there is enough resources free for them to run. The users prepare and submit their jobs on so-called frontends, machines reserved to user activity. The rest of the grid's machines, computational nodes, does the computation itself.

The grid graphics.jpg

Log on a frontend machine

Related topics
Usage of PuTTY
Other ways to access the grid from Linux
Other ways to access the grid from Windows
Remote desktop: graphical access
Kerberos: single sign-on system
Kerberos on Linux
Kerberos on Windows

Accessing the grid means logging on to one of the frontends. All frontend machines run on OS Linux. Linux' users only need to open a terminal. Windows users will need to install PuTTY, which enables them to open a Linux terminal from a Windowx PC.

Open the terminal and type the following piece of code. Note: The text after "#" is a commentary only, don't copy it.

ssh # user "jenicek" wants to log on frontend

You may want to substitute the frontend by any other. Check list of all available frontends.

ZarovkaMala.png Note: Depending on the chosen frontend, you may see different content in your /home directory. To find out more about the infrastructure, read Frontend#Home_directory or Working_with_data#Disk_arrays.

On frontends there must not be done any resource-demanding operations, such as computing or large-scale compiling and archiving, as it affects negatively all other logged in users. It is necessary to submit such operations as an interactive job.

If you log in for the first time, you will be probably prompted by a query similar to the following:

The authenticity of host ' (2001:718:ff01:1:216:3eff:fe20:382)' can't be established. ECDSA key fingerprint is SHA256:Splg9bGTNCeVSLE0E4tB30pcLS80sWuv0ezHrH1p0xE. Are you sure you want to continue connecting (yes/no)?

Type "yes" and hit Enter. After that you will be prompted for a MetaCentrum password, type it and hit Enter. A MetaCentrum welcome logo and a bunch of information about your last login, home directories etc. will appear, with a line similar to the following right at the bottom.

jenicek@skirit:~$ # user "jenicek" is now logged on the frontend "skirit", the blinking cursor waits for the command to be typed

Categorized_list_of_topics There are more tools to access the grid. For those who want to use a single sign-on method of access there is Kerberos authentication system. For full list of articles on access see this part of Guidepost.

Know Linux command line

All frontend machines run on OS Linux. This means some knowledge of the Linux CLI (command line interface) is needed. As there exist number of sites covering the Linux CLI basics, we will not give an overview here; search over the Internet for keywords like basic Linux commqands, Linux CLI, Linux CLI for newbies etc. All code snippets and examples given in this guide will be commented and explained.

Prepare input file(s)

Related topics
Windows EOL problem

Suppose there exists a file called in your PC, which you want to use as an input for the calculation done on grid. To transfer it to a frontend, open a terminal and type:

scp # this command copies file from machine to machine

Windows users can use the same command via PuTTY or WinSCP tool.

OS Windows marks end-of-lines (EOLs) in a text file differently than OS Linux. This may not be a problem in input files, but a script (text file containing a series of commands) written on Windows PC will crash in Linux, producing following error message: ./ # user jenicek wants to run a script "" with Windows EOLs
-bash: ^M: command not found

To change the line endings to be compatible with Linux, Windows users need to feed their scripts first to a dos2unix command. dos2unix # convert script "" from Windows EOLs to Linux EOLs

Note 1: Linux is case sensitive, so files "", "" and "" are treated as three different files. The same is valid for commands. For example, command ls lists the content of a directory, whereas Ls or LS will return "command not found" error.

Note 2: When naming the files and directories, do not use any special characters like stars, semicolons, ampersands etc. Using a space (e.g."my file") is possible, but not recommended; the space is normally interpreted as a separator of commands and must be back-slashed (preceded by the "\" character), if you want to use it explicitly.

jenicek@skirit~: ls my file # this command will look for 2 files called "my" and "file", respectively, in the current directory 
jenicek@skirit~: ls my\ file # this command will look for a file called "my file" in the current directory 
jenicek@skirit~: ls "my\ file" # equivalent to the line above

Text editors on frontend

In order to write or modify a file, you can always copy it to your PC, make the changes in your favourite text editor then copy it back to frontend. This operation may become, however, rather time-consuming. It may be advantageous to learn how to open and modify a file directly on the frontend.

All frontends have two Linux text editors: vi and pico.

jenicek@skirit~: vi # this command will open file "" in vi text editor
jenicek@skirit~: pico # this command will open file "" in pico text editor

Both of these editors can seem at first a bit user-unfriendly, although they are immensely powerful and effective in text editing. If you are going to modify the files often, it is advisable to learn to use either of them. There can be found numerous resources on the Internet, search for keywords like vi editor tutorial, vi commands quick reference etc.

Midnight commander

You can also view and manipulate files using Midnight commander, a visual file manager similar to Norton commander.

jenicek@skirit~: mc # open Midnight commander

Choose your applications

Related topics
List of applications (sorted alphabetically)
List of applications (sorted by topic)
Application modules
How to install an application

There is numerous scientific software installed on MetaCentrum machines, spanning from mathematical and statistical software through computational chemistry, bioinformatics to technical and material modelling software.

You can load an application offered by MetaCentrum to your job or machine via command module add + name of the selected application. If you are not sure which version of the application you would like to use, check complete list of applications page first.

For example:

jenicek@skirit:~$ module avail # shows all currently available applications
jenicek@skirit:~$ module avail 2>&1 | grep g16 # show all modules containing "g16" in their name
jenicek@skirit:~$ module add g16-B.01 # loads Gaussian 16, v. B.01 application
jenicek@skirit:~$ module list # shows currently loaded applications in your environment
jenicek@skirit:~$ module unload g16-B.01 # unloads Gaussian 16, v. B.01 application

Users can install their own software. If you would like to install a new application or new version of an application, try to read How to install an application or contact User support.

Specify required resources

Related topics
PBS options detailed guide
How to choose a particular queue or PBS server
PBS reference card [PDF]

Running a calculation on grid is not as straightforward as running it on your local computer. You cannot, for example, do this:

jenicek@skirit:~$ g16 <adenine_in_water.inp # run Gaussian16 calculation with adenine_in_water.inp input file straightaway


jenicek@skirit:/cmake/my_code/build$ make # compile some VERY large code

ZarovkaMala.png Note: Remember! Running calculation on frontend is prohibited.

The correct way is to send the computational job to PBS (portable batch system). PBS keeps track of the computational resources across the whole grid and runs the job only after enough resources have been freed.

PBS needs an estimate of how much resources (number of CPUs, time and memory) a job will need and where the temporary files shall be located.

ZarovkaMala.png Note: It is up to the user to provide educated guess of resources to the PBS.

The information about resources goes into qsub command options.

Request time memory, and number of CPUs

In the qsub command, the colons (:) and lowercase "L" (l) are divisors, and the options go in pairs of <resource>=<value>.

qsub -l select=ncpus=2:mem=4gb:scratch_local=1gb -l walltime=2:00:00


  • ncpus is number of processors (2 in this example)
  • mem is the size of memory that will be reserved for the job (4 GB in this example, default 400 MB),
  • scratch_local specifies the size and type of scratch directory (1 GB in this example, no default)
  • walltime is the maximum time the job will run, set in the format hh:mm:ss (2 hours in this example, default 24 hours)

The qsub command has a deal more options than the ones shown here; for example, it is possible to specify a number of computational nodes, type of their OS or their physical placement. A more in-depth information about PBS commands can be found in the page About scheduling system.

Specify scratch directory

Most application produce some temporary files during the calculation. Scratch directory is disk space where temporary files will are stored.

ZarovkaMala.png Note: There is no default scratch directory and the user must always specify its type.

Related topics
Types of scratch

There are three possible choices for a scratch:


  • locally on the node where calculation is done
  • temporary files are saved in /scratch/username/job_ID/ directory
  • reasonably fast, available on all machines
  • choose this option if you have no special requirements about scratch


  • network volume available for all the nodes of a specific cluster
  • mounted to directory /scratch.shared which is shared between all clusters in a given location (city)
  • read/write operation slower than on local scratch
  • useful if you need to run more than one application that need access to the same data


  • with this option scratch will be located on one of SSD disks
  • SSD - type of disk with faster read/write operations than "normal" hard drive, but with smaller capacity
  • useful if your application creates/reads from a lot of files
  • number of nodes with SSD disks is limited - choosing this option may result in longer queuing time

After the job is finished, the scratch directory and all its content will be erased. Depending on how much free space there is on the computational node this may happen immediately or after a few days.

Send job to a specific queue

Sometimes you will need to send job to a specific queue (See of queues on PBS servers .) For example

qsub -q # semd job to queue uv on PBS server

Batch or interactive job?

Related topics
How to get graphical interface for interactive jobs
Tutorial on using the trap command in batch script

Running a job can be done in two ways - either by creating a batch script or interactively. Both approaches do in essence the same thing, however depending on circumstances one may be preferable over the other.


Note: running an interactive job is not the same as running it straightaway on frontend. Here, too, the user first requests resources, waits until they are granted and only then can do what he/she likes.

Run interactive job

An interactive job is requested via qsub command and the -I (uppercase "i") option.

jenicek@elmo5-26~: qsub -I -l select=1:ncpus=2:mem=4gb:scratch_local=10gb -l walltime=1:00:00

The command in this example submits an interactive job to be run on machine with 2 processors, take up to 4 GB of RAM and last at most 1 hour. After hitting Enter, a line similar to the following will appear:

qsub: waiting for job to start

After some time, 2 lines similar to following will appear.

qsub: job ready # in this example, 11681412 is the ID of interactive job
jenicek@elmo5-26:~$ # note that user "jenicek" has been moved from a frontend (skirit) to a computational node (in this case elmo5-26)

Now, you can run calculation, compile, tar files on the command line, e.g.

jenicek@elmo5-26:~$ module load g16-B.01 # load Gaussian 16
jenicek@elmo5-26:~$ g16 < >h2o.out   # run Gaussian calculation with "" input file, output will be in "h2o.out" file

Unless you log out, after 1 hour you will get following message:

jenicek@elmo5-26:~$ =>> PBS: job killed: walltime 3630 exceeded limit 3600

qsub: job completed

This means the PBS scheduling system sent alert that some resource has run out (in this case time) and has therefore terminated the job.

Advantages of interactive jobs

  • Better overview what I'm doing, flexibility (good while compiling, archiving etc.)
  • Probably easier to learn (no need to write a batch script and risk making a mistake)
  • There is available graphical user interface for interactive jobs

Disadvantages of interactive jobs

  • Utilization of computational nodes is not optimal (waiting for user's input)
  • Not suitable to handle long-running jobs (closing a terminal or failed Internet connection will terminate the job!)
  • Difficult to handle jobs working with multiple files

Run batch jobs

In the case of batch job, all the information is packed in a script - a short piece of code. The script is then submitted via the qsub command and no further care needs to be taken. In batch jobs it is possible to specify the PBS options either on the command line, or to put then inside the script. Both ways are absolutely correct, choose what you personally prefer.

Specifying the PBS options inside the script is done via #PBS line prefix + an option. The batch script in the following example is called

#PBS -N myFirstJob
#PBS -l select=1:ncpus=4:mem=4gb:scratch_local=10gb
#PBS -l walltime=1:00:00 
#PBS -m ae
# The 4 lines above are options for scheduling system: job will run 1 hour at maximum, 1 machine with 4 processors + 4gb RAM memory + 10gb scratch memory are requested, email notification will be sent when the job aborts (a) or ends (e)

# define a DATADIR variable: directory where the input files are taken from and where output will be copied to
DATADIR=/storage/brno3-cerit/home/jenicek/test_directory # substitute username and path to to your real username and path

# append a line to a file "jobs_info.txt" containing the ID of the job, the hostname of node it is run on and the path to a scratch directory
# this information helps to find a scratch directory in case the job fails and you need to remove the scratch directory manually 
echo "$PBS_JOBID is running on node `hostname -f` in a scratch directory $SCRATCHDIR" >> $DATADIR/jobs_info.txt

#loads the Gaussian's application modules, version 03
module add g03

# test if scratch directory is set
# if scratch directory is not set, issue error message and exit
test -n "$SCRATCHDIR" || { echo >&2 "Variable SCRATCHDIR is not set!"; exit 1; }

# copy input file "" to scratch directory
# if the copy operation fails, issue error message and exit
cp $DATADIR/  $SCRATCHDIR || { echo >&2 "Error while copying input file(s)!"; exit 2; }

# move into scratch directory

# run Gaussian 03 with as input and save the results into h2o.out file
# if the calculation ends with an error, issue error message an exit
g03 < >h2o.out || { echo >&2 "Calculation ended up erroneously (with a code $?) !!"; exit 3; }

# move the output to user's DATADIR or exit in case of failure
cp h2o.out $DATADIR/ || { echo >&2 "Result file(s) copying failed (with a code $?) !!"; exit 4; }

# clean the SCRATCH directory

You can then submit your job via qsub command.

jenicek@skirit:~$ qsub # submit a batch script named "" # job received ID 11733571 from a PBS server

In case you want to specify requested resources outside batch script, move the PBS options to the submitting command: qsub -l select=1:ncpus=4:mem=4gb:scratch_local=10gb -l walltime=1:00:00 in the same way as when running the job interactively. For full description of PBS options, consult section About scheduling system.

PBS servers

There are three PBS servers that send the jobs to computational machines: (shortnamed meta), (cerit) and The server stands a bit apart, as its machines are reserved for the Elixir group. Typically the user will come accross the first two, meta and cerit.

Each of the PBS servers "sees" a different and mutually exclusive set of computing machines. Similarly, every frontend is connected with one of the three PBS servers. As a consequence, it depends on the frontend from which the job was submitted by which PBS server the job will be managed and on which computational nodes the job will be run. In this sense the frontends are not equivalent.

The trinity frontend - PBS server - set of computing machines is summed in the table below. Note that the list of computing machines is not complete!

PBS server Frontend(s) Computing machines
All physical machines with ending are managed by cerit, and vice versa
All physical machines with ending are managed by elixir, and vice versa

ZarovkaMala.png Note: Every single job requires some resources on the scheduler site. In case of very short jobs, the planning may take longer than the job itself. Therefore, if you need to submit many (more then thousand) short (less than 10 minutes) jobs, we strongly recommend to run them in batches submitted as one job. To prevent PBS server glutting, there is a quota of 10 000 jobs (running or queuing) per user.

Track your job

Related topics
[List of your jobs]

Qsub command returns jobID which you can use to track or delete your job (e.g. 1733571). If you are logged on a frontend managed by the same PBS server as the one which tracks the job, the number will suffice to identify the job. In other cases, you have to use full job ID = number + the name of the PBS server (e.g.

You can track jobs via online application PBSmon:

It is also possible to track your job on CLI via its ID (jobID). This is done by a command qstat. For example:

qstat -u jenicek # list all jobs of user "jenicek" running or queuing on the current PBS server
qstat -u jenicek # list all running or queuing jobs of user "jenicek" on all PBS servers
qstat -xu jenicek # list finished jobs for user "jenicek" 
qstat -f <jobID> # list details of the running or queueing job with a given jobID
qstat -xf <jobID> # list details of the finished job with a given jobID

After submitting a job and checking its status, you will see typically something like the following.

jenicek@skirit~: qstat -u jenicek # show the status of all running or queing jobs submitted by user "jenicek" 
                                                                 Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
11733550.meta-pbs.*  jenicek q_2h         --    1   1    1gb 00:05 Q   --

The letter under the header 'S' (status) gives the status of the job. The most common states are:

  • Q – queued
  • R – running
  • F – finished

Apart from these, quite often you can see on the PBSmon job list jobs with status denoted "M" (moved). This means the job has been moved from one PSB Pro server to another.

Tracking running jobs

Follow these steps if you would like to check outputs of a job, which has not finished yet:

1. Find what machine is your job running: -> "Show my jobs". You will see a page similar to the following: Job pbsmon 1.png

A click on the job's ID will open a page with full information about a job, including the hostname (= machine where the job is running on) and a path to the scratch directory. Job pbsmon 2.png

2. Login to the machine from any frontend ssh command. E.g.


3. Navigate to the /var/spool/pbs/spool/ directory and examine the files:

  • $PBS_JOBID.OU for standard output (stdout – e.g. “”)
  • $PBS_JOBID.ER for standard error output (stderr – e.g. “”)

To watch file a continuously, you can also use a command tail -f.$ tail -f # this command outputs appended data as the file grows

Get e-mail notification about a job

To keep track of submitted jobs, some users prefer to get notification via e-mail. To do this, add the following line to your batch script

#PBS -M # specify recipient's email; if this option is not used, your registered email will be used
#PBS -m abe # send mail when the job begins (b), aborts (a) or ends (e)


#PBS -m ae # send mail when aborts (a) or ends (e)

Caution: different e-mail providers apply various filters to protect their users from spam. If you submit a lot of jobs with the PBS -m abe option, they may end up in your spam folder.

Examine job's standard output and standard error output

When a job is completed (no matter how), two files are created in the directory from which you have submitted the job. One represents standard output and the other one standard error output:

 <job_name>.o<jobID> # contains job's output data
 <job_name>.e<jobID> # contains job's standard error output

The standard error output contains all the error messages which occurred during the calculation. It is a first place where to look if the job has failed. The messages collected in standard error output are valuable source of information about why the job has failed. In case you contact user support to ask for help, do not remove the error file, but send it as an attachment together with your request.

You can copy these files to your personal computer (scp command) for further processing. You can also examine them directly on CLI by any of the following commands.$ cat # print whole content of file "" on standard output$ cat | more # print whole content of file "" on standard output screenful-by-screenful (press spacebar to go to another screen)$ vi # open file "" in text editor vi$ less # open file "" read only

Job termination

Often it is the case the user submits a job and only then realizes something was wrong or missing in the input. There is a way to dump waiting or running job, the qdel command.

jenicek@skirit~: qdel # delete the job with full job ID ""

Clean the scratch manually

In case of erroneous job ending, the data are left in the scratch directory. You should always clean the scratch after all potentially useful data has been retrieved. To do so, you need to know the hostname of machine where the job was run, and path to the scratch directory.

ZarovkaMala.png Note: Users' rights settings allow to remove only the content of the scratch directory, not the directory itself.

jenicek@skirit:~$ ssh # login to a hostname
jenicek@luna13:~$ cd /scratch/jenicek/ # enter the scratch directory
jenicek@luna13:/scratch/jenicek/$ rm -r * # remove all files and subdirectories

Log off

Logging off is simple.

jenicek@skirit:~$ exit
Connection to closed.

Logging off will terminate any currently running interactive jobs. The batch jobs are independent on whether the user is logged on/off and will not be affected.