train_loop_tracking

class aitoolbox.torchtrain.train_loop.train_loop_tracking.TrainLoopCheckpoint(model, train_loader, validation_loader, test_loader, optimizer, criterion, project_name, experiment_name, local_model_result_folder_path, hyperparams, cloud_save_mode='s3', bucket_name='model-result', cloud_dir_prefix='', source_dirs=(), rm_subopt_local_models=False, num_best_checkpoints_kept=2, iteration_save_freq=0, collate_batch_pred_fn=<function append_predictions>, pred_transform_fn=<function torch_cat_transf>, end_auto_eval=True, lazy_experiment_save=False, gpu_mode='single', cuda_device_idx=None, use_amp=False)[source]

Bases: aitoolbox.torchtrain.train_loop.train_loop.TrainLoop

TrainLoop with the automatic model check-pointing at the end of each epoch

Parameters
  • model (TTModel or ModelWrap or TTDataParallel) – neural network model

  • train_loader (torch.utils.data.DataLoader) – data loader for train data set

  • validation_loader (torch.utils.data.DataLoader or None) – data loader for validation data set

  • test_loader (torch.utils.data.DataLoader or None) – data loader for test data set

  • optimizer (torch.optim.optimizer.Optimizer or MultiOptimizer) – optimizer algorithm.

  • criterion (torch.nn.modules.loss._Loss or MultiLoss or None) – criterion during the training procedure

  • project_name (str) – root name of the project

  • experiment_name (str) – name of the particular experiment

  • local_model_result_folder_path (str) – root local path where project folder will be created

  • hyperparams (dict) – used hyper-parameters. When running the TrainLoop from jupyter notebook in order to ensure the python experiment file copying to the experiment folder, the user needs to manually specify the python file path as the value for the experiment_file_path key. If running the training directly from the terminal the path deduction is done automatically.

  • cloud_save_mode (str or None) – Storage destination selector. For AWS S3: ‘s3’ / ‘aws_s3’ / ‘aws’ For Google Cloud Storage: ‘gcs’ / ‘google_storage’ / ‘google storage’ Everything else results just in local storage to disk

  • bucket_name (str) – name of the bucket in the cloud storage

  • cloud_dir_prefix (str) – path to the folder inside the bucket where the experiments are going to be saved

  • source_dirs (list or tuple) – paths to the local folders with the source code files used in experiment

  • rm_subopt_local_models (bool or str) – if True, the deciding metric is set to ‘loss’. Give string metric name to set it as a deciding metric for suboptimal model removal. If metric name consists of substring ‘loss’ the metric minimization is done otherwise metric maximization is done

  • num_best_checkpoints_kept (int) – number of best performing models which are kept when removing suboptimal model checkpoints

  • iteration_save_freq (int) – frequency of saving the model checkpoint every specified number of training iterations

  • collate_batch_pred_fn (callable) – collate function transforming batch predictions as they come out from the model

  • pred_transform_fn (callable) – function transforming all the produced predictions after all the batches have been run through the model

  • end_auto_eval (bool or int) – used to optionally disable otherwise automatic end of epoch/training val/test loss calculations. This is useful when conducting very costly experiments to save on compute time. Specify either True/False boolean to always run or never run after each epoch or specify an int to execute only every specified number of epochs.

  • lazy_experiment_save (bool) – when in lazy mode experiment tracking components will create the experiment folder only after some training results are available (possibly at the end of the first epoch) instead of at the beginning of training.

  • gpu_mode (str) –

    GPU training mode selection. TrainLoop supports different GPU training modes by specifying one of the following:

    • 'single': single GPU training

    • 'dp': multi-GPU training via DataParallel

    • 'ddp': multi-GPU training via DistributedDataParallel

  • cuda_device_idx (int or None) – CUDA device index used when training on multiple GPUs

  • use_amp (bool or dict) –

    use 16-bit Automatic Mixed Precision (AMP)

    To switch to AMP mode either:

    • set this parameter to True to use default AMP torch.cuda.amp.GradScaler initialization params

    • provide custom AMP torch.cuda.amp.GradScaler initialization parameters as a dict as this parameter

class aitoolbox.torchtrain.train_loop.train_loop_tracking.TrainLoopEndSave(model, train_loader, validation_loader, test_loader, optimizer, criterion, project_name, experiment_name, local_model_result_folder_path, hyperparams, val_result_package=None, test_result_package=None, cloud_save_mode='s3', bucket_name='model-result', cloud_dir_prefix='', source_dirs=(), collate_batch_pred_fn=<function append_predictions>, pred_transform_fn=<function torch_cat_transf>, end_auto_eval=True, lazy_experiment_save=False, gpu_mode='single', cuda_device_idx=None, use_amp=False)[source]

Bases: aitoolbox.torchtrain.train_loop.train_loop.TrainLoop

TrainLoop with the model performance evaluation and final model saving at the end of the training process

Parameters
  • model (TTModel or ModelWrap or TTDataParallel) – neural network model

  • train_loader (torch.utils.data.DataLoader) – data loader for train data set

  • validation_loader (torch.utils.data.DataLoader or None) – data loader for validation data set

  • test_loader (torch.utils.data.DataLoader or None) – data loader for test data set

  • optimizer (torch.optim.optimizer.Optimizer or MultiOptimizer) – optimizer algorithm.

  • criterion (torch.nn.modules.loss._Loss or MultiLoss or None) – criterion during the training procedure

  • project_name (str) – root name of the project

  • experiment_name (str) – name of the particular experiment

  • local_model_result_folder_path (str) – root local path where project folder will be created

  • hyperparams (dict) – used hyper-parameters. When running the TrainLoop from jupyter notebook in order to ensure the python experiment file copying to the experiment folder, the user needs to manually specify the python file path as the value for the experiment_file_path key. If running the training directly from the terminal the path deduction is done automatically.

  • val_result_package (AbstractResultPackage or None) – result package evaluated on validation data at the end of the training

  • test_result_package (AbstractResultPackage or None) – result package evaluated on test data at the end of the training

  • cloud_save_mode (str or None) – Storage destination selector. For AWS S3: ‘s3’ / ‘aws_s3’ / ‘aws’ For Google Cloud Storage: ‘gcs’ / ‘google_storage’ / ‘google storage’ Everything else results just in local storage to disk

  • bucket_name (str) – name of the bucket in the cloud storage

  • cloud_dir_prefix (str) – path to the folder inside the bucket where the experiments are going to be saved

  • source_dirs (list or tuple) – paths to the local folders with the source code files used in experiment

  • collate_batch_pred_fn (callable) – collate function transforming batch predictions as they come out from the model

  • pred_transform_fn (callable) – function transforming all the produced predictions after all the batches have been run through the model

  • end_auto_eval (bool or int) – used to optionally disable otherwise automatic end of epoch/training val/test loss calculations. This is useful when conducting very costly experiments to save on compute time. Specify either True/False boolean to always run or never run after each epoch or specify an int to execute only every specified number of epochs.

  • lazy_experiment_save (bool) – when in lazy mode experiment tracking components will create the experiment folder only after some training results are available (possibly at the end of the first epoch) instead of at the beginning of training.

  • gpu_mode (str) –

    GPU training mode selection. TrainLoop supports different GPU training modes by specifying one of the following:

    • 'single': single GPU training

    • 'dp': multi-GPU training via DataParallel

    • 'ddp': multi-GPU training via DistributedDataParallel

  • cuda_device_idx (int or None) – CUDA device index used when training on multiple GPUs

  • use_amp (bool or dict) –

    use 16-bit Automatic Mixed Precision (AMP)

    To switch to AMP mode either:

    • set this parameter to True to use default AMP torch.cuda.amp.GradScaler initialization params

    • provide custom AMP torch.cuda.amp.GradScaler initialization parameters as a dict as this parameter

check_if_result_packages_possible()[source]
class aitoolbox.torchtrain.train_loop.train_loop_tracking.TrainLoopCheckpointEndSave(model, train_loader, validation_loader, test_loader, optimizer, criterion, project_name, experiment_name, local_model_result_folder_path, hyperparams, val_result_package=None, test_result_package=None, cloud_save_mode='s3', bucket_name='model-result', cloud_dir_prefix='', source_dirs=(), rm_subopt_local_models=False, num_best_checkpoints_kept=2, iteration_save_freq=0, collate_batch_pred_fn=<function append_predictions>, pred_transform_fn=<function torch_cat_transf>, end_auto_eval=True, lazy_experiment_save=False, gpu_mode='single', cuda_device_idx=None, use_amp=False)[source]

Bases: aitoolbox.torchtrain.train_loop.train_loop_tracking.TrainLoopEndSave

TrainLoop both saving model check-pointing at the end of each epoch and model performance reporting

and model saving at the end of the training process

Parameters
  • model (TTModel or ModelWrap or TTDataParallel) – neural network model

  • train_loader (torch.utils.data.DataLoader) – data loader for train data set

  • validation_loader (torch.utils.data.DataLoader or None) – data loader for validation data set

  • test_loader (torch.utils.data.DataLoader or None) – data loader for test data set

  • optimizer (torch.optim.optimizer.Optimizer or MultiOptimizer) – optimizer algorithm.

  • criterion (torch.nn.modules.loss._Loss or MultiLoss or None) – criterion during the training procedure

  • project_name (str) – root name of the project

  • experiment_name (str) – name of the particular experiment

  • local_model_result_folder_path (str) – root local path where project folder will be created

  • hyperparams (dict) – used hyper-parameters. When running the TrainLoop from jupyter notebook in order to ensure the python experiment file copying to the experiment folder, the user needs to manually specify the python file path as the value for the experiment_file_path key. If running the training directly from the terminal the path deduction is done automatically.

  • val_result_package (AbstractResultPackage or None) – result package evaluated on validation data at the end of the training

  • test_result_package (AbstractResultPackage or None) – result package evaluated on test data at the end of the training

  • cloud_save_mode (str or None) – Storage destination selector. For AWS S3: ‘s3’ / ‘aws_s3’ / ‘aws’ For Google Cloud Storage: ‘gcs’ / ‘google_storage’ / ‘google storage’ Everything else results just in local storage to disk

  • bucket_name (str) – name of the bucket in the cloud storage

  • cloud_dir_prefix (str) – path to the folder inside the bucket where the experiments are going to be saved

  • source_dirs (list or tuple) – paths to the local folders with the source code files used in experiment

  • rm_subopt_local_models (bool or str) – if True, the deciding metric is set to ‘loss’. Give string metric name to set it as a deciding metric for suboptimal model removal. If metric name consists of substring ‘loss’ the metric minimization is done otherwise metric maximization is done

  • num_best_checkpoints_kept (int) – number of best performing models which are kept when removing suboptimal model checkpoints

  • iteration_save_freq (int) – frequency of saving the model checkpoint every specified number of training iterations

  • collate_batch_pred_fn (callable) – collate function transforming batch predictions as they come out from the model

  • pred_transform_fn (callable) – function transforming all the produced predictions after all the batches have been run through the model

  • end_auto_eval (bool or int) – used to optionally disable otherwise automatic end of epoch/training val/test loss calculations. This is useful when conducting very costly experiments to save on compute time. Specify either True/False boolean to always run or never run after each epoch or specify an int to execute only every specified number of epochs.

  • lazy_experiment_save (bool) – when in lazy mode experiment tracking components will create the experiment folder only after some training results are available (possibly at the end of the first epoch) instead of at the beginning of training.

  • gpu_mode (str) –

    GPU training mode selection. TrainLoop supports different GPU training modes by specifying one of the following:

    • 'single': single GPU training

    • 'dp': multi-GPU training via DataParallel

    • 'ddp': multi-GPU training via DistributedDataParallel

  • cuda_device_idx (int or None) – CUDA device index used when training on multiple GPUs

  • use_amp (bool or dict) –

    use 16-bit Automatic Mixed Precision (AMP)

    To switch to AMP mode either:

    • set this parameter to True to use default AMP torch.cuda.amp.GradScaler initialization params

    • provide custom AMP torch.cuda.amp.GradScaler initialization parameters as a dict as this parameter