Skip to content

Training

Before getting started with this section, please make sure you have read the documentation on setup and preprocessing.

Training#

Training is fully configured via a YAML config file and a CLI powered by pytorch lightning. This allows you control all aspects of the training from config or directly via command line arguments.

The configuration is split into two parts. The base.yaml config contains model-independent information like the input file paths and batch size. This config is used by default for all trainings without you having to explicitly specify it. Meanwhile the model configs, for example gnn.yaml contain a full description of a specific model, including a list of input variables used. You can start a training for a given model by providing it as an argument to the main.py python script, which is also exposed through the command salt.

salt fit --config configs/GN2.yaml

The subcommand fit specifies you want to train the model, rather than evaluate it. It's possible to specify more than one configuration file, for example to override the values set in base.yaml. The CLI will merge them automatically.

Running a test training

To test the training script you can run use the --trainer.fast_dev_run flag which will run over a small number of training and validation batches and then exit.

salt fit --config configs/GN2.yaml --trainer.fast_dev_run 2

Logging and checkpoint are suppressed when using this flag.

Using the CLI#

You can also configure the training directly through CLI arguments. For a full list of available arguments run

salt fit --help

Choosing GPUs#

By default the config will try to use the first available GPU, but you can specify which ones to use with the --trainer.devices flag. Take a look here to learn more about the different ways you can specify which GPUs to use.

Check GPU usage before starting training

You should check with nvidia-smi that any GPUs you use are not in use by some other user before starting training.

Resuming Training#

Model checkpoints are saved in timestamped directories under logs/. These directories also get a copy of the fully merging training config (config.yaml), and a copy of the umami scale dict. To resume a training, point to a previously saved config and a .ckpt checkpoint file by using the --ckpt_path argument.

salt fit --config logs/run/config.yaml --ckpt_path path/to/checkpoint.ckpt

The full training state, including the state of the optimiser, is resumed. The logs for the resumed training will be saved in a new directory, but the epoch count will continue from where it left off.

Note that the OneCycleLR learning rate scheduler will set a pre-determined total number of steps in the cycle. In order to resume a training which exceeds the maximum number of epochs, using e.g. --trainer.max_epochs <value>, you need to re-set the --model.lrs_config.last_epoch 0 as well.

Reproducibility#

Reproducibility of tagger trainings is very important. The following features try to ensure reproducibility without too much overhead.

Commit Hashes & Tags#

When training, the framework will complain if you have any uncommitted changes. You can override this by setting the --force flag, but you should only do so if you are sure that preserving your training configuration is not necessary. The commit hash used for the training is saved in the metadata.yaml for the training run.

Additionally, you can create and push a tag of the current state using the --tag argument. This will push a tag to your personal fork of the repo. The tag name is the same as the output directory name, which is generated automatically by the framework.

Using --force will prevent a tag from being created and pushed

Try and avoid using --force if you can, as it hampers reproducibility.

Random Seeds#

Training runs are reproducible thanks to the --seed_everything flag, which is already set for you in the base.yaml config. The flat seeds all random number generators used in the training. This means for example that weight initialisation and data shuffling happen in a deterministic way.

Stochastic operations can still lead to divergences between training runs

For more info take a look here.

Dataloading#

Objects are loaded in weakly shuffled batches from the training file. This is much more efficient than randomly accessing individual entries, which would be prohibitively slow.

Some other dataloading considerations are discussed below.

Worker Counts#

During training, data is loaded using worker processes. The number of workers used is specified by data.num_workers. Increasing the worker count will speed up training until you max out your GPUs processing capabilities. Test different counts to find the optimal value, or just set this to the number of CPUs on your machine.

Maximum worker counts

You can find out the number of CPUs available on your machine by running

cat /proc/cpuinfo | awk '/^processor/{print $3}' | tail -1

You should not use more workers than this. If you use too few or too many workers, you will see a warning at the start of training.

Fast Disk Access#

Most HPC systems will have dedicated fast storage. Loading training data from these drives can significantly improve training times. To temporarily copy training files into a target directory before training, use the --data.move_files_temp=/temp/path/ flag.

If you have enough RAM, you can load the training data into shared memory before starting training by setting move_files_temp to a path under /dev/shm/<username>.

Ensure temporary files are removed

The code will try to remove the temporary files when the training is complete, but if the training is interrupted this may not happen. You should double check whether you need to manually remove the temporary files to avoid clogging up your system's RAM.

Batch systems#

HTCondor Batch#

Those at institutions with HTCondor managed GPU batch queues can submit training jobs using

python submit/submit_htcondor.py --config configs/GN2.yaml

It is required to pass the path to the configuration file via the --config argument. Further, it is possible to specify the environment in which the job will run with -e / --environment. This depends how you installed salt and can either be a local environment, a conda environment (which is assumed to be in the salt/conda directory) or a singularity container.

By default, all job jobmission files and batch logs will be stored in the condor directory which is created upon job submission.

Running multiple trainings in parallel is possible by specifying different names for the jobs using the -t / --tag argument.

The job parameters such as memory requirements, number of GPUs and CPUs requested have to be modified in the file submit/submit_htcondor.py.

Slurm Batch#

Those at institutions with Slurm managed GPU batch queues can submit training jobs using a very similar script.

All options described above for HTCondor and more (CPUs, GPUs, etc) are available as command line arguments.

python submit/submit_slurm.py --config configs/GN2.yaml --tag test_salt --account MY-ACCOUNT --nodes 1 --gpus_per_node 2

Where arguments need to agree between Slurm and Pytorch Lightning, such as ntasks-per-node for Slurm and trainer.devices for Lightning, this is handled by the script.

Lightning has the ability to requeue a job if it is killed by Slurm for exceeding the system walltime. The training state is saved in a checkpoint and loaded when the new job begins. submit_slurm.py creates a single log directory holding the checkpoints for the original and any requeue-d jobs (in the below example GN2_my_requeue_job).

python submit/submit_slurm.py --config configs/GN2.yaml --requeue --salt_log_dir=my_requeue_job --signal=SIGUSR1@90

There are many options for both Slurm and Salt, not all of which are explicitly handled but which should be available to a user if needed. To access these you can add slurm. (for Slurm) or config. (for Salt) to the beginning of an argument and it will be converted into a valid argument. Note that the "=" should be used in this case instead of a space.

python submit/submit_slurm.py --config configs/GN2.yaml --tag test_salt --account MY-ACCOUNT --nodes 1 --gpus_per_node 2 --config.trainer.max_epochs=1 --slurm.dependency=afterok:1111111 

The submit/submit_slurm.py script itself can also be modified if necessary.

There is also an older submit/submit_slurm.sh bash script that is kept around for compatibility. Users are strongly encouraged to use the python script.

Cleaning up after interruption

If training is interrupted, you can be left with floating worker processes on the node which can clog things up for other users. You should check for this after running training jobs which are cancelled or fail. To do so, srun into the affected node using

srun --pty --cpus-per-task 2 -w compute-gpu-0-3 -p GPU bash

and then kill any remaining running python processes using

pkill -u <username> -f salt -e

Troubleshooting#

If you encounter issues, as a first step you should try pulling the latest updates from main to see if your problem has been resolved. If you need more help you can post on mattermost.

Slow Training#

This section contains some suggestions for speeding up trainings. Some external advice can be found here and here.

If you are not producing a "final" version of your model (i.e. with maximum possible performance), but instead are running some studies, you should consider the following:

  • Limit the training statistics (e.g. 20M samples)
  • Reduce the number of epochs you train for (e.g. 20 epochs)
  • Remove any auxiliary tasks
  • Compile the model

Other things you can always do:

Confusing Errors#

You might see confusing/cryptic errors when running on the GPU, for example

../aten/src/ATen/native/cuda/NLLLoss2d.cu:103: nll_loss2d_forward_kernel: block: [377,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.

Often, if you instead run on the CPU you will get a much more helpful error message. Use --trainer.accelerator=cpu to run on the CPU instead of the GPU.

NaNs#

Salt will automatically check:

You may still encounter nan values in your outputs and losses. Here are some mitigation strategies you can try:

  • Make sure you have pulled the latest changes from main.
  • Make doubly sure that your inputs are finite, even apply applying normalisation.
  • Ensure you don't have unexpected non-finite labels.
  • Try lowering your max learning rate in the lrs_config.
  • If you apply very large loss weights in your task configs, these might contribute to large gradients, so you can try removing any loss weights provided to your Tasks.
  • Check your training precision: if you have done the above and still have problems, you can try --trainer.precision=32 or --trainer.precision=bf16-mixed. See here for more info.
  • Apply gradient clipping to negate the effects of exploding gradients. See [here] for more info.
  • Auto detect gradient anomalies. See here for more info.
  • If you are running on multiple GPUs, try running on a single GPU with --trainer.devices=1

Last update: June 27, 2024
Created: October 7, 2022