FAQ/Job submission

From MetaCentrum
< FAQ
Jump to: navigation, search

Warning.gif WARNING: This page will be removed soon

Contents

I have a problem concerning submission of a script for PBS, I am obtaining an error message: "^M: command not found". What is the reason of this problem and how can I solve it??

Described problem is related to the fact that your script for PBS system originated in competitive operational system (DOS/Windows) which treates line ends of a text file differently compared to Unix/Linux. Users are allowed to convert their files from DOS/Windows before submission using command dos2unix (fromdos).

Imagine a following situation: I inform a job using stagein where the job should obtain the input (input files} and I submit the job, the job is queued and in the meantime I change the job input files in the directory. Will be the job computed with the input before the job submission or will be the job computed using the input that is stored at specific stagein address when the job starts to be executed at a CPU?

Input files/directories are copied using stagein in the moment when the job is being executed at a specific node (there is a standard data transfer using SCP) before/after job execution). I.e. if you change the input when the job is queued, the job should be computed with new input (with obvious premise that the input files have the same names).

I want to run jobs explicitly at machine eru1.ruk.cuni.cz or at aule.ics.muni.cz. Is there a way how to specify to qsub command that I want only one concrete machine and no other?

There is a way using command

qsub -l nodes=MACHINE_NAME:ppn=NUMBER_OF_PROCESSORS 

It would it be more useful do not select concrete machine but arbitrary machine with 64-bit system (architecture AMD64 or EM64T), then it is better to require in qsub command property amd64.

What does the error message "qsub: Cannot find a valid kerberos 5 ticket: TGT ticket expired" mean?

A valid Kerberos ticket is required for successful submission of a computational job into PBS batch system. Commonly an user obtains Kerberos ticket automatically after successful login to METACentrum machine. A ticket is valid for 10 hours. If the Kerberos ticket expires and the user wants to submit a job into the system the mentioned error message will appear. This can be solved by renewal of the ticket validity by command kauth (or kinit &ndash) or by new login. Current state of your Kerberos tickets can be checked using command klist. More information can be found in Kerberos dokumentation.

Is there a way how to set a script for my jobs that fell out from the queue not to be computed automatically again? (because I would prefer to submit them on my own)

Add parameter -r n to a command qsub.

How can I set jobs to run after a specific time period? (e.g. if I want them to prepare in advance)

Add parameter -a to a command qsub (-a date_time Declares the time after which the job is eligible for execution).

I compiled my program using Fortran compiler from Intel. However when it runs in a queue the error message „IOSTAT = 29“ will appear. During execution of the program in the command line there is no such error message. Is it necessary to specify paths for the input files when running in a queue?

In the beginning of task script, which you submit to PBS, is good to change current directory according your needs. Otherwise your job runs in current directory of cluster, not in directory, which was actual during running qsub. (For a link to this directory from task script is possible to use a variable PBS_O_WORKDIR, if job runs on the same cluster as qsub.) It is possible to have the input/output files copied automatically to/from a computational node through parameters -W stagein=... and -W stageout=... command qsub. It is good to use AFS directory (~/shared), which is slower( particulary not so good for temporary and huge files), but visible from all Metacentrum nodes.

According to the documentation programs should run from scratch directory. How this can be arranged if I do not have the right to write in the scratch directory?

Your sub-directory on the scratch volume (i.e: /scratch/you_user_name/job_JOBID), identified by the $SCRATCHDIR environment variable, is mentioned. (You have this directory just in powerful nodes, not in leading node of cluster ( skirit.ics.muni.cz etc.)). Task script could look like this. e.g:

cd $SCRATCHDIR
cp /storage/brno1/home/your_user_name/.../input* .
# the computation
cp output* /storage/brno1/home/your_user_name/.../
# cleaning:
rm -rf $SCRATCHDIR

I have scheduled a set of jobs in a queue and then exchange the binary used for solving the jobs. Is this change taken into account for already scheduled jobs? Does PBS try to store every information needed to run a job?

PBS hides task script and settings of variable during submitting task. The script is executed after its initiating, everything else (called programs, files for stagein etc.) is used actual. In case of binnary or other difficulty structured files change is important to focus on influencing currently running task - it is good to rename a file, runnig task could use older version without danger.

cc -o program.version2 ....
mv program program.version1 && mv program.version2 program

or throught symbolic links, which linking to particular version

cc -o program.version1 ...
ln -s program.version1 program
# running
cc -o program.version2 ...
rm program && ln -s program.version2 program

I want to be informed via e-mail in case of job terminanting. What should I do?

Add parameter -m ae to command qsub. All details of command qsub you can find in qsub manual page man qsub.

How to detect where was job placed and how to log in this machine and have full possibility to check running job?

You can see your jobs at portal and you can see there used machines and processors for each job. Then you can log in them through SSH (PuTTy) and check your jobs as you want.

I need to solve automatic delete of working directory in case of computing interuption (because I have the proper command to delete working directory at the end of the script).

If it is computing interuption because of limit exceeding, you can handle it by setting the task to respond to SIGTERM signal and remove the files in this case. The catching of signals is defined in shell script by command trap, e.g.

trap "clean_scratch" SIGTERM

It will work also in case of interuption of computing by command qdel. Cases when the crash of job is made by serious system error (e.g. blackout) can not be handled automatically.

If I run 2 GPU jobs on the same node, 1 GPU per each. How I recognize which GPU belongs to the job?

Type this command: /usr/sbin/list_cache arien gpu_allocation | grep $PBS_JOBID

My jobs are in the "M" state. What does it mean?

MetaCentrum uses three PBS servers (pbs.elixir-czech.cz, arien-pro.ics.muni.cz and wagap-pro.cerit-sc.cz) and you can easily check the status of your running and waiting jobs on these servers by the command (on any front-end)

qstat -u YOUR_USER_NAME @pbs.elixir-czech.cz @arien-pro.ics.muni.cz @wagap-pro.cerit-sc.cz

If you find some job in the "M" state (Moved), there is no reason for panic. Your job is moved from one PBS server to another, where are more free resources. Typically you can meet this situation when jobs from PBS servers arien-pro.ics.muni.cz and wagap-pro.cerit-sc.cz are moving to PBS servers pbs.elixir-czech.cz.