Best practices

Job walltime

Starting a job incurs overhead: - Job scheduling time - Environment setup time (loading modules, activating conda environments, etc.)

When using atools, you have to remember that this overhead is incurred for each job in the array, so it can potentially add up to a significant amount of time. This implies that the walltime for jobs should be long enough to amortize this overhead. As a rule of thumb, we recommend that jobs should run for at least 30 minutes. If your jobs are shorter than this, consider bundling multiple tasks into a single job.

I/O performance

A shared (parallel) file system is a critical component of an HPC cluster. However, these file systems are typically optimized for large, sequential I/O operations, and can perform poorly with many small, random I/O operations.

When using atools, it is important to consider the I/O patterns of your tasks. The load generated by many tasks performing small I/O operations can overwhelm the file system. To mitigate this, consider the following strategies: - Use appropriate file formats: Some file formats are more efficient for parallel access than others. For example, using HDF5 or Parquet can improve performance compared to plain text files. - Use local storage: If your compute nodes have local storage (e.g., SSDs), use it for temporary files during task execution. This can significantly reduce the load on the shared file system. - Batch I/O operations: Instead of writing many small files, consider aggregating data into larger files. - Optimize data access patterns: Try to access data in a sequential manner as much as possible, rather than randomly. - Limit the number of concurrent tasks: You can limit the number of jobs that run at any given time by setting a limit, e.g., --array=1-200%5 ensures that at most 5 jobs run concurrently.