FAQ/Grid computing

From MetaCentrum
< FAQ
Jump to: navigation, search

Contents

Accessing the machines

The machine I am trying to log on does not respond, what shall I do?

There is always small chance the machine is down. Before starting more systematic diagnostic, check the planned outages at [1].

Why it is not possible to connect to server skirit.ics.muni.cz using SSH, WinSCP respectively?

If you use client PuTTy, your configuration is not compatible with settings of our SSH servers. Please, change your client settings accordingly: the prefered SSH protocol version to 2 (instead of 1), see item Preferred SSH protocol version at Connection->SSH tab and also activate item Attempt "keyboard-interactive" auth (SSH-2) at Connection->SSH->Auth tab.

How do I disable the Message of the Day?

After login on some of our frontend, you are greeted by the system's Message of the Day with information about your last login, quotas, available CPUs, etc. If you do not want to see this post from some reason, you are allowed to turn it off by function .hushlogin. Usage is very simple (do it directly in your home directory on selected frontend):

$ touch .hushlogin 
$ logout 

After new login Message of the Day will not be displayed. Removing .hushlogin from your home directory

$ rm .hushlogin

will set Message of the Day to the original state.

I cannot access /storage directories despite being logged in on a frontend

Common user problem is that they are not able to access the folders in /storage while they are logged in on a frontend by ssh command. The reason is missing Kerberos ticket. To renew the ticket, run the kinit command.

$ kinit

Jobs

General

I have a problem concerning submission of a script for PBS, I am obtaining an error message: "^M: command not found". What is the reason of this problem and how can I solve it??

Described problem is related to the fact that your script for PBS system originated in competitive operational system (DOS/Windows) which treates line ends of a text file differently compared to Unix/Linux. Users are allowed to convert their files from DOS/Windows before submission using command dos2unix (fromdos).

Imagine a following situation: I inform a job using stagein where the job should obtain the input (input files} and I submit the job, the job is queued and in the meantime I change the job input files in the directory. Will be the job computed with the input before the job submission or will be the job computed using the input that is stored at specific stagein address when the job starts to be executed at a CPU?

Input files/directories are copied using stagein in the moment when the job is being executed at a specific node (there is a standard data transfer using SCP) before/after job execution). I.e. if you change the input when the job is queued, the job should be computed with new input (with obvious premise that the input files have the same names).

I compiled my FORTRAN program using Intel compiler. However when it runs in a queue the error message „IOSTAT = 29“ will appear. During execution of the program in the command line there is no such error message. Is it necessary to specify paths for the input files when running in a queue?

In the beginning of task script, which you submit to PBS, is good to change current directory according your needs. Otherwise your job runs in current directory of cluster, not in directory, which was actual during running qsub. (For a link to this directory from task script is possible to use a variable PBS_O_WORKDIR, if job runs on the same cluster as qsub.) It is possible to have the input/output files copied automatically to/from a computational node through parameters -W stagein=... and -W stageout=... command qsub. It is good to use AFS directory (~/shared), which is slower( particulary not so good for temporary and huge files), but visible from all Metacentrum nodes.

I run 2 GPU jobs on the same node, 1 GPU per each. How do I recognize which GPU belongs to the job?

Type this command: /usr/sbin/list_cache arien gpu_allocation | grep $PBS_JOBID

My jobs are in the "M" state. What does it mean?

MetaCentrum uses three PBS servers (pbs.elixir-czech.cz, arien-pro.ics.muni.cz and wagap-pro.cerit-sc.cz) and you can easily check the status of your running and waiting jobs on these servers by the command (on any front-end)

qstat -u YOUR_USER_NAME @pbs.elixir-czech.cz @arien-pro.ics.muni.cz @wagap-pro.cerit-sc.cz

If you find some job in the "M" state (Moved), there is no reason for panic. Your job is moved from one PBS server to another, where are more free resources. Typically you can meet this situation when jobs from PBS servers arien-pro.ics.muni.cz and wagap-pro.cerit-sc.cz are moving to PBS servers pbs.elixir-czech.cz.

What does the error message "qsub: Cannot find a valid kerberos 5 ticket: TGT ticket expired" mean?

A valid Kerberos ticket is required for successful submission of a computational job into PBS batch system. Commonly an user obtains Kerberos ticket automatically after successful login to METACentrum machine. A ticket is valid for 10 hours. If the Kerberos ticket expires and the user wants to submit a job into the system the mentioned error message will appear. This can be solved by renewal of the ticket validity by command kauth (or kinit &ndash) or by new login. Current state of your Kerberos tickets can be checked using command klist. More information can be found in Kerberos dokumentation.

I forgot to set some limits in qsub, what shall I do?

If you don't specify the queue and limit of memory, processors and nodes, the default queue normal and 400MB of memory, 1 node and 1 processor are used.

Tuning up the resources needs on the test job

For the first try to run one job of the specific type for the test and check it's needs. Other job of the same type can be run afterwards with tuned parameters. It is worth to know hom much memory and disc space the job needs before you run several jobs and system kills them all. You can see details of your jobs in the PBSmon including it's specified and consumed resources. Unfortunately PBSmon shows the jobs quite complex so it does not show the consumed resources on each used node but in the whole. You can find more in section Requests don't count with number of machines. At the beginning is good to demand more resources and then lower the needs. It can take longer to gather required resources for the test job, but the other jobs will be planned faster.

You can see the computing needs of your job by planning the interactive job at the machine with sufficient resources and observe how many threads runs on what processors (htop). This can be done quite quickly, the job don't have to end, you can kill it after several seconds. The most frequent errors are in wrong run of parallel jobs and in wrong set up of certain programs (Java, Matlab). You can find more by reading the documentation for specific applications. The memory needs of the job can be found in PBSmon after the end of the job because it can vary during the computation.

Requests don't count with number of machines

You have specified more nodes in qsub (ie. -l nodes=2:ppn=3 -l mem=4gb) but you forgot that memory is allocated for one machine. In other words for your job is reserved 8GB of memory and 6 processors, but on the 2 machines so the job can not exceed 4GB of memory and 3 processors at one machine. This principle is influenced by license needs of various SW, such as Gaussian.

Scheduling system

I want to run jobs explicitly at machine eru1.ruk.cuni.cz or at aule.ics.muni.cz. Is there a way how to specify to qsub command that I want only one concrete machine and no other?

There is a way using command

qsub -l nodes=MACHINE_NAME:ppn=NUMBER_OF_PROCESSORS 

It would it be more useful do not select concrete machine but arbitrary machine with 64-bit system (architecture AMD64 or EM64T), then it is better to require in qsub command property amd64.

How can I set jobs to run after a specific time period? (e.g. if I want them to prepare in advance)

Add parameter -a to a command qsub (-a date_time Declares the time after which the job is eligible for execution).

Applications

Java and Matlab erroneous memory usage

Some programs are more likely to exceed the requested resources regarding the memory limits. It is mainly Java and Matlab. See the respective pages for examples of correct usage.

Gaussian

Memory alocation in program Gaussian version G03.E01 does not work. What should I do?

In case you get warning: "buffer allocation failed in ntrext1." just after running the program, use a directory on a scratch volume (the directory is identified by the $SCRATCHDIR environment variable) as a working directory instead of the /home (there is an error in the new version, that occurs according type of file system)

I can not run queen properly at testing example (see /home/pavelm/queen/test)

In case I run queen locally in directory /home/pavelm/queen/test (by command "queen --Iuni example noe") the job runs properly, in case I want to queue the job, it fails with error message: "ERROR - Dictionary /home/pavelm/queen/test/queen.conf could not be read. Script has been stopped." I tryied to run this script by psubmit and also by qsub in normal queue ("qsub -q normal run" nebo "psubmit normal run"). Script contains both "metamodule add queen" and setting of path to configuration file into environment variable QUEEN_CONF ("QUEEN_CONF=/home/pavelm/queen/test/queen.conf;export QUEEN_CONF"). Script to execute is following:

metamodule add queen
Q_PROJECT=/home/pavelm/queen/test/
export Q_PROJECT
QUEEN_CONF=/home/pavelm/queen/test/queen.conf
export QUEEN_CONF
#export Q_PROJECT=/home/pavelm/queen/test/
#export QUEEN_CONF=/home/pavelm/queen/test/queen.conf
queen --Iuni example noe

Unfortunatelly I can not learn why my configuration file can not be read

Path which is set into QUEEN_CONF must be available from machine, where PBS ran your job (in case of parallel jobs from all nodes), so path /home/pavelm/... is local for clusters (leads on different clusters to different disks) Either put queen.conf into your directory in AFS(/afs/ics.muni.cz/home/pavelm, or /home/pavelm/shared) this directory is seen from all MetaCentrum machines, or limit running of job just to machines, which have access to your directory. E.g: by option PBS -l brno (-l ...:brno) if you have it in /home on skirit or perian.

...For program efective running you have to put subdirectory "log" in assigment (item Q_LOG in queen.conf) into the fastest storage -- e.g. on a scratch volume (in a directory dedicated for a job -- identified by the $SCRATCHDIR environment variable) or in a /storage volume.

I linked through Q_LOG on $SCRATCHDIR ... Can I apply above described use of $SCRATCHDIR to parallel job in program xplor-nih.

According documentation (/afs/ics.muni.cz/software/xplor-nih-2.20/parallel.txt) you have to have a directory shared by all used nodes using program xplor in parallel mode. The best way should be to use the shared scratch volume (available just at the mandos cluster -- the dedicated space is identified by the $SCRATCHDIR environment variable), or to use a NFSv4 volume within the /storage directory. By the way, QUEEN must be able to read assingment from all nodes (e.g. it could be copied for just for reading), but subdirectory log could and it is sufficient to be on each node local, because some parts of a job run in it as separely runned xplor instance which communicate with queen local processes through files. Result of whole job is written just by main process.

Gaussian

Memory alocation in program Gaussian version G03.E01 does not work. What should I do?

In case you get warning: "buffer allocation failed in ntrext1." just after running the program, use a directory on a scratch volume (the directory is identified by the $SCRATCHDIR environment variable) as a working directory instead of the /home (there is an error in the new version, that occurs according type of file system)

I think there is a problem with the node orca14-2. Gaussian computings have fallen down immediately after running with error message (every time in different part of the log file) ´´Erroneous write. Write 8192 instead of 12288.´´

Reason of this failure is that your disk is full of undeleted Gaussian helping files. These files are remains from your previous computings which weren't successful. You can avoid this situation by deleting all helping files after the end of computing. The easiest way is to create a own directory for every single computing and to remove this directory in the end of computing.

GAUSS_SCRDIR=$SCRATCHDIR
GAUSS_ARCHDIR=$GAUSS_SCRDIR
mkdir $GAUSS_SCRDIR
...
g03 ...
rm -rf $GAUSS_SCRDIR