FAQ/Grid computing

From MetaCentrum
< FAQ
Jump to: navigation, search

Contents

Accessing the machines

The machine I am trying to log on does not respond, what shall I do?

There is always small chance the machine is down. Before starting more systematic diagnostic, check the planned outages at [1].

Why it is not possible to connect to server skirit.ics.muni.cz using SSH, WinSCP respectively?

If you use client PuTTy, your configuration is not compatible with settings of our SSH servers. Please, change your client settings accordingly: the prefered SSH protocol version to 2 (instead of 1), see item Preferred SSH protocol version at Connection->SSH tab and also activate item Attempt "keyboard-interactive" auth (SSH-2) at Connection->SSH->Auth tab.

How do I disable the Message of the Day?

After login on some of our frontend, you are greeted by the system's Message of the Day with information about your last login, quotas, available CPUs, etc. If you do not want to see this post from some reason, you are allowed to turn it off by function .hushlogin. Usage is very simple (do it directly in your home directory on selected frontend):

$ touch .hushlogin 
$ logout 

After new login Message of the Day will not be displayed. Removing .hushlogin from your home directory

$ rm .hushlogin

will set Message of the Day to the original state.

Job submission

General

I have a problem concerning submission of a script for PBS, I am obtaining an error message: "^M: command not found". What is the reason of this problem and how can I solve it??

Described problem is related to the fact that your script for PBS system originated in competitive operational system (DOS/Windows) which treates line ends of a text file differently compared to Unix/Linux. Users are allowed to convert their files from DOS/Windows before submission using command dos2unix (fromdos).

Imagine a following situation: I inform a job using stagein where the job should obtain the input (input files} and I submit the job, the job is queued and in the meantime I change the job input files in the directory. Will be the job computed with the input before the job submission or will be the job computed using the input that is stored at specific stagein address when the job starts to be executed at a CPU?

Input files/directories are copied using stagein in the moment when the job is being executed at a specific node (there is a standard data transfer using SCP) before/after job execution). I.e. if you change the input when the job is queued, the job should be computed with new input (with obvious premise that the input files have the same names).

I compiled my FORTRAN program using Intel compiler. However when it runs in a queue the error message „IOSTAT = 29“ will appear. During execution of the program in the command line there is no such error message. Is it necessary to specify paths for the input files when running in a queue?

In the beginning of task script, which you submit to PBS, is good to change current directory according your needs. Otherwise your job runs in current directory of cluster, not in directory, which was actual during running qsub. (For a link to this directory from task script is possible to use a variable PBS_O_WORKDIR, if job runs on the same cluster as qsub.) It is possible to have the input/output files copied automatically to/from a computational node through parameters -W stagein=... and -W stageout=... command qsub. It is good to use AFS directory (~/shared), which is slower( particulary not so good for temporary and huge files), but visible from all Metacentrum nodes.

I run 2 GPU jobs on the same node, 1 GPU per each. How do I recognize which GPU belongs to the job?

Type this command: /usr/sbin/list_cache arien gpu_allocation | grep $PBS_JOBID

My jobs are in the "M" state. What does it mean?

MetaCentrum uses three PBS servers (pbs.elixir-czech.cz, arien-pro.ics.muni.cz and wagap-pro.cerit-sc.cz) and you can easily check the status of your running and waiting jobs on these servers by the command (on any front-end)

qstat -u YOUR_USER_NAME @pbs.elixir-czech.cz @arien-pro.ics.muni.cz @wagap-pro.cerit-sc.cz

If you find some job in the "M" state (Moved), there is no reason for panic. Your job is moved from one PBS server to another, where are more free resources. Typically you can meet this situation when jobs from PBS servers arien-pro.ics.muni.cz and wagap-pro.cerit-sc.cz are moving to PBS servers pbs.elixir-czech.cz.

What does the error message "qsub: Cannot find a valid kerberos 5 ticket: TGT ticket expired" mean?

A valid Kerberos ticket is required for successful submission of a computational job into PBS batch system. Commonly an user obtains Kerberos ticket automatically after successful login to METACentrum machine. A ticket is valid for 10 hours. If the Kerberos ticket expires and the user wants to submit a job into the system the mentioned error message will appear. This can be solved by renewal of the ticket validity by command kauth (or kinit &ndash) or by new login. Current state of your Kerberos tickets can be checked using command klist. More information can be found in Kerberos dokumentation.

I forgot to set some limits in qsub, what shall I do?

If you don't specify the queue and limit of memory, processors and nodes, the default queue normal and 400MB of memory, 1 node and 1 processor are used.

Scheduling system

I want to run jobs explicitly at machine eru1.ruk.cuni.cz or at aule.ics.muni.cz. Is there a way how to specify to qsub command that I want only one concrete machine and no other?

There is a way using command

qsub -l nodes=MACHINE_NAME:ppn=NUMBER_OF_PROCESSORS 

It would it be more useful do not select concrete machine but arbitrary machine with 64-bit system (architecture AMD64 or EM64T), then it is better to require in qsub command property amd64.

How can I set jobs to run after a specific time period? (e.g. if I want them to prepare in advance)

Add parameter -a to a command qsub (-a date_time Declares the time after which the job is eligible for execution).

Applications

Java and Matlab erroneous memory usage

Some programs are more likely to exceed the requested resources regarding the memory limits. It is mainly Java and Matlab. See the respective pages for examples of correct usage.

Gaussian

Memory alocation in program Gaussian version G03.E01 does not work. What should I do?

In case you get warning: "buffer allocation failed in ntrext1." just after running the program, use a directory on a scratch volume (the directory is identified by the $SCRATCHDIR environment variable) as a working directory instead of the /home (there is an error in the new version, that occurs according type of file system)