Trap command usage

From MetaCentrum
Jump to navigation Jump to search

There does not exist a single optimal batch script for all possible types of calculations. The example script in Beginners_guide#Run_batch_jobs follows a rather data-protective policy:

  • clean the scratch directory only if there was no error,
  • in case the job was interrupted (either by user or by PBS), or ended erroneously, do not clean the scratch.

This policy may not fit all users, as it shifts the responsibility for cleaning scratch on user in case of non-standard ending of the job. There is a way to automate scratch cleaning in case of error, namely the trap command.

Trap the SIGTERM

When the job is killed either by PBS or by the user (qdel command), the following happens:

Sigterm sigkill.jpg

The batch script receives SIGTERM signal. There is no way how to distinguish whether the job was killed by PBS or by the user. On receiving the SIGTERM, the running process may take a variety of actions - it may stop immediately, or it may attempt to clean up AND stop, or it may do nothing. If the process keeps running, the SIGTERM signal is after several seconds followed by SIGKILL (equivalent to kill -9), which stops it immediately.

What action is taken upon receiving a SIGTERM can be defined via trap command. (SIGKILL cannot be trapped, ignored nor reacted to.)

#!/bin/bash
trap 'clean_scratch' TERM # clean the scratch if you receive SIGTERM (kill by PBS or user qdel)

Caution! This solution is useful to get rid of mess left after user-killed jobs, but it may backfire when the job was killed by PBS, typically when walltime limits are exceeded and the clean_scratch removes all potentially valuable checkpoint files. Adding

#!/bin/bash
trap 'cp all_checkpoint_files somewhere_safe/ ; clean_scratch' TERM # on SIGTERM, attempt to copy away potentially valuable files before cleaning scratch

can improve things, but will clutter user's home directory by unwanted files in other cases. Moreover, if the files are large and/or numerous, the copying may not finish in time before being interrupted by SIGKILL signal and the data need to be retrieved from scratch manually anyway.

Trap the EXIT

EXIT is not a signal, but for the purpose of trap command it can be treated in the same way. EXIT happens when the script ends, either by executing the last line or via the exit command like in the code snippet below:

#!/bin/bash
test -n "$SCRATCHDIR" || { echo >&2 "Variable SCRATCHDIR is not set!"; exit 1; }

If the trap for EXIT is set

#!/bin/bash
trap 'clean_scratch' EXIT # if the script exits, clean scratch

the scratch will be cleaned if the script hits exit command or - at the latest - after it runs to an end.

Caution! The use of trap upon EXIT can backfire, too. Suppose the user adds the trap with the purpose to clean up after the script has run to an end, then adds some petty sanity check after the core calculation is done.

#!/bin/bash

...

trap 'clean_scratch' EXIT

...

./potentially_long_calculation_producing_result_files

test -n some-directory || { echo >&2 "Directory some-directory does not exist!"; exit 1; }

cp result_files somewhere/
...

This, too, can lead to unintentional loss of results, as the clean_scratch is executed before the result files are copied away.