FAQ/Grid computing

Z MetaCentrum
< FAQ
Skočit na navigaci Skočit na vyhledávání


Metacentrum wiki is deprecated after March 2023
Dear users, due to integration of Metacentrum into https://www.e-infra.cz/en (e-INFRA CZ service), the documentation for users will change format and site.
The current wiki pages won't be updated after end of March 2023. They will, however, be kept for a few months for backwards reference.
The new documentation resides at https://docs.metacentrum.cz.

Accessing the machines

The machine I am trying to log on does not respond, what shall I do?

Try to login on other frontend instead (see list of frontends at Frontend). Check the planned outages at [1]. If there is no outage and you cannot log in onto any other frontend, most probably there is some problem on your side (firewall, expired account etc.). Do not hesitate to write for support.

Why it is not possible to connect to server skirit.ics.muni.cz using SSH, WinSCP respectively?

If you use client PuTTy, your configuration may not be compatible with settings of our SSH servers. Please, change your client settings accordingly: the Connection->SSH -> prefered SSH protocol version to 2 (instead of 1) and also activate item Attempt "keyboard-interactive" auth (SSH-2) at Connection->SSH->Auth tab.

How do I disable the Message of the Day?

After login on some of our frontend, you are greeted by the system's Message of the Day with information about your last login, quotas, available CPUs, etc. If you do not want to see this post from some reason, you are allowed to turn it off by function .hushlogin. Usage is very simple (do it directly in your home directory on selected frontend):

$ touch .hushlogin 
$ logout 

After new login Message of the Day will not be displayed. Removing .hushlogin from your home directory

$ rm .hushlogin

will set Message of the Day to the original state.

I cannot access /storage directories despite being logged in on a frontend

Common user problem is that they are not able to access the folders in /storage while they are logged in on a frontend for longer time (currently more than 10 hours). The reason is expired Kerberos ticket. To renew the ticket, run the kinit command. Anyway it is a good practice to log out after you finish work for the day.

$ kinit

How do I stop other users seeing (reading, copying) files in my home directory?

All volumes are covered by standard set of unix permissions. Default right of user directories and files is 755, ie. rwxr-xr-x. This means other users can view and copy your files, but they cannot neither alter nor delete them. In case you want to change this behavior you can do it by invoking command

chmod 700 <directory>

To change the rights for your home directory and for computing/temporal directory (scratch) use

chmod 700 /home/$USER
chmod 700 $SCRATCHDIR

Jobs

General

I have a problem concerning the submission of a script for PBS, I am obtaining an error message: "^M: command not found" or "$'\r': command not found"

Described problem is related to the fact that your script for PBS system originated in the competitive operational system (DOS/Windows) which treats line ends of a text file differently compared to Unix/Linux. Users are allowed to convert their files from DOS/Windows before submission using command dos2unix (fromdos). If you want to verify the presence of DOS/Windows line endings you should use file command. E.g. file /path/to/the/script.txt. If the command says with CRLF line terminator it means your file has MS/DOS line endings.

If I submit a job with a certain input file(s) and meanwhile change the input(s), which version of the input file(s) will be used when the job finally starts?

Input files/directories are copied using stage in the moment when the job is being executed at a specific node (there is a standard data transfer using SCP) before/after job execution). I.e. if you change the input while the job is queued, the job will be computed with new input (with obvious premise that the input files have the same names).

I was told to clear up the SCRATCHDIR after the job finishes, but I get "permission denied" for the operation

The common confusion is that users try to remove the whole scratch directory

rm -rf $SCRATCHDIR

while they should, and are allowed to, remove only the content of the scratch directory:

rm -rf $SCRATCHDIR/*

The empty scratch directory will be removed automatically after some time.

What does the error message "No Kerberos credentials found" mean?

A valid Kerberos ticket is required for successful submission of a computational job into PBS batch system. Commonly an user obtains Kerberos ticket automatically after successful login to Metacentrum machine. A ticket is valid for 10 hours. If the Kerberos ticket expires and the user wants to submit a job into the system the mentioned error message will appear. This can be solved by renewal of the ticket validity by command kauth (or kinit) or by new login. Current state of your Kerberos tickets can be checked using command klist. More information can be found in Kerberos documentation.

I want to prolong my job

Users are allowed to prolong running jobs to a certain extent. For detailed info see Prolong_walltime. Only after you have run out of your cpu-time fund, contact user support.

PBS

My jobs are in the "M" state. What does it mean?

MetaCentrum uses three PBS servers (elixir-pbs.elixir-czech.cz, meta-pbs.metacentrum.cz and cerit-pbs.cerit-sc.cz) and you can easily check the status of your running and waiting jobs on these servers by the command (on any front-end)

qstat -u YOUR_USER_NAME @elixir-pbs.elixir-czech.cz @meta-pbs.metacentrum.cz @cerit-pbs.cerit-sc.cz

If you find some job in the "M" state (Grid_computing&action=submit#I_want_to_prolong_my_jobMoved). This means your job was moved from one PBS server to another, where are more free resources. Typically PBS servers meta-pbs.metacentrum.cz and cerit-pbs.cerit-sc.cz move their jobs to server elixir-pbs.elixir-czech.cz.

I forgot to set some limits in qsub, what shall I do?

If you don't specify the queue and limit of memory, processors and nodes, the defaults are 400 MB of memory, 24 hours of walltime and 1 CPU. The walltime limit can be extended (see FAQ/Grid computing). Other resources cannot be changed after the job is submitted.

Is there a way to specify the queue an/or the PBS server for the job?

Yes, this is possible. See About scheduling system.

How can I set jobs to run after a specific time period? (e.g. if I want them to prepare in advance)

Add parameter -a to a command qsub (-a date_time Declares the time after which the job is eligible for execution).

Stuck job, qdel command does not work

To delete "stuck" jobs, type:

qdel -W force <job id>

Running job mysteriously disappeared from qstat -u command

The job was most probably moved between PBS servers; the qstat -u username lists only jobs run under the current PBS server. To list jobs at all servers, modify the command as

qstat -u USERNAME @meta-pbs.metacentrum.cz @cerit-pbs.cerit-sc.cz @elixir-pbs.elixir-czech.cz

Working with data

I copy or move large data (hundreds of GB) and it takes very long time

Copying large data is not as straightforward operation as copying a few files or a single directory. For a detailed guide how to treat large data see Working with data page.

Applications

Java application requires more CPUs than assigned

Despite setting number of threads/CPUs to the number of CPUs allocated for the job, some Java application may use more CPUs. This is because Java can use additional internal threads on top of those explicitly used by the application and expects no limits by default. If you encounter this behaviour, try adding following parameters to java command line (just behind "java"):

-XX:-BackgroundCompilation -XX:ParallelGCThreads=$PBS_NCPUS

or

-XX:-BackgroundCompilation -XX:+UseSerialGC

Java and Matlab erroneous memory usage

Some programs are more likely to exceed the requested resources regarding the memory limits. It is mainly Java and Matlab. See the respective pages for examples of correct usage.

Missing libraries

Compatibility issues with some Debian10 applications (libraries missing) are continually resolved by recompiling new SW modules. If you encounter a problem with your application, try adding the debian9-compat module to the beginning of the submission script.

 module add debian9-compat

Gaussian

Memory alocation in program Gaussian version G03.E01 does not work. What should I do?

In case you get warning: "buffer allocation failed in ntrext1." just after running the program, use a directory on a scratch volume (the directory is identified by the $SCRATCHDIR environment variable) as a working directory instead of the /home (there is an error in the new version, that occurs according type of file system)

I linked through Q_LOG on $SCRATCHDIR ... Can I apply above described use of $SCRATCHDIR to parallel job in program xplor-nih.

According documentation (/afs/ics.muni.cz/software/xplor-nih-2.20/parallel.txt) you have to have a directory shared by all used nodes using program xplor in parallel mode. The best way should be to use the shared scratch volume (available just at the mandos cluster -- the dedicated space is identified by the $SCRATCHDIR environment variable), or to use a NFSv4 volume within the /storage directory. By the way, QUEEN must be able to read assingment from all nodes (e.g. it could be copied for just for reading), but subdirectory log could and it is sufficient to be on each node local, because some parts of a job run in it as separely runned xplor instance which communicate with queen local processes through files. Result of whole job is written just by main process.

I think there is a problem with the node orca14-2. Gaussian computings have fallen down immediately after running with error message (every time in different part of the log file) ´´Erroneous write. Write 8192 instead of 12288.´´

Reason of this failure is that your disk is full of undeleted Gaussian helping files. These files are remains from your previous computings which weren't successful. You can avoid this situation by deleting all helping files after the end of computing. The easiest way is to create a own directory for every single computing and to remove this directory in the end of computing.

GAUSS_SCRDIR=$SCRATCHDIR
GAUSS_ARCHDIR=$GAUSS_SCRDIR
mkdir $GAUSS_SCRDIR
...
g03 ...
rm -rf $GAUSS_SCRDIR