Causes of unnatural end of job

From MetaCentrum
Jump to: navigation, search

This page describes the possible causes of unnatural job end by the scheduling system. The scheduling system kills the job in the situations when the job exceeds memory, processors or time length limits. Disrespect to required limits leads always to unnatural job end due to an effort to protect other users from expansiveness of your job. To solve the actual problem once more read the scheduling system description and notes for your application.

What resources are supervised

  • walltime – how long is the job running. It cannot run longer than the defined How to compute/Requesting resources#Queues|queue time]] or than is specified in parameter walltime.
  • used CPUs – the job is killed when it computed on more processors than were granted by the scheduling system.
  • used memory – the memory occupied by the running job.
    • Note: Qsub recognizes two parameters regarding memory - mem and vmem. For running you jobs please use only mem parameter, vmem (= mem + swap) is only for special purposes which, the most probably, you will not meet.

The default of allocated resources when no qsub parameter is specified, is

queue default, forwarded into queue q_1d, which have default walltime 24 hours
1 node
1 processor
400MB of memory


You forgot to set some limit

If you don't specify the queue and limit of memory, processors and nodes, the default stat of the job is queue normal, 400MB of memory, 1 node and 1 processor. More about qsub parameters can be found at dedicated page. You should always set the number of nodes, processors, memory limit and scratch size.

Tuning up the resources needs on the test job

For the first try to run one job of the specific type for the test and check it's needs. Other job of the same type can be run afterwards with tuned parameters. It is worth to know hom much memory and disc space the job needs before you run several jobs and system kills them all. You can see details of your jobs in the PBSmon including it's specified and consumed resources. Unfortunately PBSmon shows the jobs quite complex so it does not show the consumed resources on each used node but in the whole. You can find more in section Requests don't count with number of machines. At the beginning is good to demand more resources and then lower the needs. It can take longer to gather required resources for the test job, but the other jobs will be planned faster.

You can see the computing needs of your job by planning the interactive job at the machine with sufficient resources and observe how many threads runs on what processors (htop). This can be done quite quickly, the job don't have to end, you can kill it after several seconds. The most frequent errors are in wrong run of parallel jobs and in wrong set up of certain programs (Java, Matlab). You can find more by reading the documentation for specific applications. The memory needs of the job can be found in PBSmon after the end of the job because it can vary during the computation.

Requests don't count with number of machines

You have specified more nodes in qsub (ie. -l nodes=2:ppn=3 -l mem=4gb) but you forgot that memory is allocated for one machine. In other words for your job is reserved 8GB of memory and 6 processors, but on the 2 machines so the job can not exceed 4GB of memory and 3 processors at one machine. This principle is influenced by license needs of various SW, such as Gaussian.

The most frequent wrongly used programs

Some programs are more likely to exceed the requested resources regarding the memory limits. It is mainly Java and Matlab. The examples of correct usage of these programs are here in wiki.