Logging for fun and profit
Often, it is useful to log information about the execution of individual tasks. This information can be used
- to monitor the progress of the array job in real time,
- to check which tasks completed, and whether that was successful, or whether some failure occurred, and
- to obtain statistics on the execution time of tasks, or the execution time per node that was running the job's tasks.
Most queue systems and schedulers will provide this information, but often not in a convenient format.
alog is called at the start of a task, and at its end, and provides
centralized logging in a single file. This requires a POSIX compliant
shared file system when the job is running on multiple compute nodes.
Again, consider the fragment of the job script:
#!/bin/bash ... alpha=0.5 beta=-1.3 Rscript bootstrap.R $alpha $beta ...
aenv was not used here, which was done to stress the point
aenv are independent of one another, but obviously can
To enable logging, a call to
alog is added as the first and the last
executable line of the job script, i.e.,
#!/bin/bash ... alog --state start alpha=0.5 beta=-1.3 Rscript bootstrap.R $alpha $beta alog --state end --exit $?
Here we assume that the exit status of the last actual job command
Rscript in this example) is also the exit status of the task. The
Linux convention is that exit code 0 signifies success, any value between
from 1 to 255 indicates a failure. It is the value passed to the logger
--exit <status> option that will be recorded in the log file,
and that will determine whether a task is considered failed, or
The resulting log file is automatically created, and its name will be
<job-name>.log<job-id> where the job name and job identifier follow
the conventions of the queue system or scheduler used.
The log file will look like, e.g.,
1 started by r1i1n3 at 2016-09-02 11:47:45 2 started by r1i1n3 at 2016-09-02 11:47:45 3 started by r1i1n3 at 2016-09-02 11:47:46 2 failed by r1i1n3 at 2016-09-02 11:47:46: 1 3 completed by r1i1n3 at 2016-09-02 11:47:47
The format is
<task-id> <status> by <node-name> at <time-stamp>, followed
: <exit-status> for failed jobs. For this particular example, task
1 didn't complete, 2 failed, and 3 completed successfully.
alog takes a single option,
--conf <conf> to specify your own
Help on the command is printed using the