Best practices
Job walltime
Starting a job incurs overhead: - Job scheduling time - Environment setup time (loading modules, activating conda environments, etc.)
When using atools, you have to remember that this overhead is incurred for each job in the array, so it can potentially add up to a significant amount of time. This implies that the walltime for jobs should be long enough to amortize this overhead. As a rule of thumb, we recommend that jobs should run for at least 30 minutes. If your jobs are shorter than this, consider bundling multiple tasks into a single job.
I/O performance
A shared (parallel) file system is a critical component of an HPC cluster. However, these file systems are typically optimized for large, sequential I/O operations, and can perform poorly with many small, random I/O operations.
When using atools, it is important to consider the I/O patterns of your tasks. The
load generated by many tasks performing small I/O operations can overwhelm the file system.
To mitigate this, consider the following strategies:
- Use appropriate file formats: Some file formats are more efficient for parallel
access than others. For example, using HDF5 or Parquet can improve performance
compared to plain text files.
- Use local storage: If your compute nodes have local storage (e.g., SSDs),
use it for temporary files during task execution. This can significantly reduce
the load on the shared file system.
- Batch I/O operations: Instead of writing many small files, consider
aggregating data into larger files.
- Optimize data access patterns: Try to access data in a sequential manner
as much as possible, rather than randomly.
- Limit the number of concurrent tasks: You can limit the number of jobs
that run at any given time by setting a limit, e.g., --array=1-200%5 ensures
that at most 5 jobs run concurrently.