Monitoring jobs and resuming tasks
Keeping track of the tasks already completed, successfully or not, or tasks
still pending can be somewhat annoying. Resuming tasks that were not
completed, or that failed requires a level of bookkeeping you may prefer
arange is designed to help with both issues.
Note that for this to work, your job should do logging using
Monitoring a running job
Given either the CSV file or the task identifier range for a job, and its
log file as generated by
arange will provide statistics on the
progress of a running job, or a summary on a completed job.
If the log file's name is
bootstrap.pbs.log10493, and the job was based
on an CSV data file
data.csv, a summary can be obtained by
$ arange --data data.csv --log bootstrap.pbs.log10493 --summary
In case a job has been resumed, you should list all log files relevant to the job to get correct results.
arange parses the data file, it also has the
--sniff option to
specify the number of bytes to use to determine the dialect of the CSV
file. For files with many columns, the number of bytes the sniffer will
use to determine the file's structure and dialect should be increased
from the default value.
For data files that have a single column only, the sniffer gets confused.
It can be switched off using the
arange works independently of
aenv, so it also supports
keeping track of general job arrays using the
$ arange -t 1-250 --log bootstrap.pbs.log10493 --summary
Sometimes it is useful to explicitly list the task identifiers of either
failed or completed jobs as task identifier ranges, this can be done by
--list_completed flags respectively.
arange primary purpose is in fact helping to determine which task
identifiers should be redone when an array job did not complete, or when
some of its tasks failed. To get an identifier range of tasks that were
not completed, use
$ arange --data data.csv --log bootstrap.pbs.log10493`
or, when not using
$ arange -t 1-250 --log bootstrap.pbs.log10493`
If you want to include the tasks that failed, for instance when a bug that
caused this was fixed, simply add the
--redo flag when invoking
Help on the command is printed using the