Reasons for an unusual job ending

From MetaCentrum
Jump to: navigation, search

Warning.gif WARNING: This page will be removed soon

Related topics
Requesting resources
About scheduling system

This page describes the possible reasons for an unusual job ending. The scheduling system kills the job in the situations when the job exceeds memory, processors or time length limits. If you can't find answer to your question it this topic, try to read Requesting resources and About scheduling system.

What resources are supervised

  • walltime – how long is the job running. It cannot run longer than the defined queue time or than is specified in parameter walltime.
  • used CPUs – the job is killed when it computed on more processors than were granted by the scheduling system.
  • used memory – the memory occupied by the running job.
    • Note: Qsub recognizes two parameters regarding memory - mem and vmem. For running you jobs please use only mem parameter, vmem (= mem + swap) is only for special purposes which, the most probably, you will not meet.

The default of allocated resources when no qsub parameter is specified, is

queue normal
1 node
1 processor
400MB of memory


You forgot to set some limit

If you don't specify the queue and limit of memory, processors and nodes, the default stat of the job is queue normal, 400MB of memory, 1 node and 1 processor. More about qsub parameters can be found at dedicated page. You should always set the number of nodes, processors, memory limit and scratch size.

Tuning up the resources needs on the test job

For the first try to run one job of the specific type for the test and check it's needs. Other job of the same type can be run afterwards with tuned parameters. It is worth to know hom much memory and disc space the job needs before you run several jobs and system kills them all. You can see details of your jobs in the PBSmon including it's specified and consumed resources. Unfortunately PBSmon shows the jobs quite complex so it does not show the consumed resources on each used node but in the whole. You can find more in section Requests don't count with number of machines. At the beginning is good to demand more resources and then lower the needs. It can take longer to gather required resources for the test job, but the other jobs will be planned faster.

You can see the computing needs of your job by planning the How to compute/Interactive jobs at the machine with sufficient resources and observe how many threads runs on what processors (htop). This can be done quite quickly, the job don't have to end, you can kill it after several seconds. The most frequent errors are in wrong run of parallel jobs and in wrong set up of certain programs (Java, Matlab). You can find more by reading the documentation for specific applications. The memory needs of the job can be found in PBSmon after the end of the job because it can vary during the computation.

Requests don't count with number of machines

You have specified more nodes in qsub (ie. -l nodes=2:ppn=3 -l mem=4gb) but you forgot that memory is allocated for one machine. In other words for your job is reserved 8GB of memory and 6 processors, but on the 2 machines so the job can not exceed 4GB of memory and 3 processors at one machine. This principle is influenced by license needs of various SW, such as Gaussian.

The most frequent wrongly used programs

Some programs are more likely to exceed the requested resources regarding the memory limits. It is mainly Java and Matlab. The examples of correct usage of these programs are here in wiki.