train_loop_tracking
- class aitoolbox.torchtrain.train_loop.train_loop_tracking.TrainLoopCheckpoint(model, train_loader, validation_loader, test_loader, optimizer, criterion, project_name, experiment_name, local_model_result_folder_path, hyperparams, cloud_save_mode='s3', bucket_name='model-result', cloud_dir_prefix='', source_dirs=(), rm_subopt_local_models=False, num_best_checkpoints_kept=2, iteration_save_freq=0, collate_batch_pred_fn=<function append_predictions>, pred_transform_fn=<function torch_cat_transf>, end_auto_eval=True, lazy_experiment_save=False, print_callbacks=False, gpu_mode='single', cuda_device_idx=None, use_amp=False)[source]
Bases:
TrainLoop
TrainLoop with the automatic model check-pointing at the end of each epoch
- Parameters:
model (TTModel or ModelWrap or TTDataParallel) – neural network model
train_loader (torch.utils.data.DataLoader) – data loader for train data set
validation_loader (torch.utils.data.DataLoader or None) – data loader for validation data set
test_loader (torch.utils.data.DataLoader or None) – data loader for test data set
optimizer (torch.optim.Optimizer or MultiOptimizer) – optimizer algorithm.
criterion (torch.nn.Module or MultiLoss or None) – criterion during the training procedure
project_name (str) – root name of the project
experiment_name (str) – name of the particular experiment
local_model_result_folder_path (str) – root local path where project folder will be created
hyperparams (dict) – used hyper-parameters. When running the TrainLoop from jupyter notebook in order to ensure the python experiment file copying to the experiment folder, the user needs to manually specify the python file path as the value for the experiment_file_path key. If running the training directly from the terminal the path deduction is done automatically.
cloud_save_mode (str or None) – Storage destination selector. For AWS S3: ‘s3’ / ‘aws_s3’ / ‘aws’ For Google Cloud Storage: ‘gcs’ / ‘google_storage’ / ‘google storage’ Everything else results just in local storage to disk
bucket_name (str) – name of the bucket in the cloud storage
cloud_dir_prefix (str) – path to the folder inside the bucket where the experiments are going to be saved
source_dirs (list or tuple) – paths to the local folders with the source code files used in experiment
rm_subopt_local_models (bool or str) – if True, the deciding metric is set to ‘loss’. Give string metric name to set it as a deciding metric for suboptimal model removal. If metric name consists of substring ‘loss’ the metric minimization is done otherwise metric maximization is done
num_best_checkpoints_kept (int) – number of best performing models which are kept when removing suboptimal model checkpoints
iteration_save_freq (int) – frequency of saving the model checkpoint every specified number of training iterations
collate_batch_pred_fn (callable) – collate function transforming batch predictions as they come out from the model
pred_transform_fn (callable) – function transforming all the produced predictions after all the batches have been run through the model
end_auto_eval (bool or int) – used to optionally disable otherwise automatic end of epoch/training val/test loss calculations. This is useful when conducting very costly experiments to save on compute time. Specify either True/False boolean to always run or never run after each epoch or specify an int to execute only every specified number of epochs.
lazy_experiment_save (bool) – when in lazy mode experiment tracking components will create the experiment folder only after some training results are available (possibly at the end of the first epoch) instead of at the beginning of training.
print_callbacks (bool) – at the start of training print the list of registered callbacks which will be executed during the run of the train loop
gpu_mode (str) –
GPU training mode selection. TrainLoop supports different GPU training modes by specifying one of the following:
'single'
: single GPU training'dp'
: multi-GPU training via DataParallel'ddp'
: multi-GPU training via DistributedDataParallel
cuda_device_idx (int or None) – CUDA device index used when training on multiple GPUs
Use 16-bit Automatic Mixed Precision (AMP).
To switch to AMP mode either:
set this parameter to
True
to use default AMPGradScaler
initialization paramsprovide custom AMP
GradScaler
initialization parameters as a dict as this parameter
- class aitoolbox.torchtrain.train_loop.train_loop_tracking.TrainLoopEndSave(model, train_loader, validation_loader, test_loader, optimizer, criterion, project_name, experiment_name, local_model_result_folder_path, hyperparams, val_result_package=None, test_result_package=None, cloud_save_mode='s3', bucket_name='model-result', cloud_dir_prefix='', source_dirs=(), collate_batch_pred_fn=<function append_predictions>, pred_transform_fn=<function torch_cat_transf>, end_auto_eval=True, lazy_experiment_save=False, print_callbacks=False, gpu_mode='single', cuda_device_idx=None, use_amp=False)[source]
Bases:
TrainLoop
TrainLoop with the model performance evaluation and final model saving at the end of the training process
- Parameters:
model (TTModel or ModelWrap or TTDataParallel) – neural network model
train_loader (torch.utils.data.DataLoader) – data loader for train data set
validation_loader (torch.utils.data.DataLoader or None) – data loader for validation data set
test_loader (torch.utils.data.DataLoader or None) – data loader for test data set
optimizer (torch.optim.Optimizer or MultiOptimizer) – optimizer algorithm.
criterion (torch.nn.Module or MultiLoss or None) – criterion during the training procedure
project_name (str) – root name of the project
experiment_name (str) – name of the particular experiment
local_model_result_folder_path (str) – root local path where project folder will be created
hyperparams (dict) – used hyper-parameters. When running the TrainLoop from jupyter notebook in order to ensure the python experiment file copying to the experiment folder, the user needs to manually specify the python file path as the value for the experiment_file_path key. If running the training directly from the terminal the path deduction is done automatically.
val_result_package (AbstractResultPackage or None) – result package evaluated on validation data at the end of the training
test_result_package (AbstractResultPackage or None) – result package evaluated on test data at the end of the training
cloud_save_mode (str or None) – Storage destination selector. For AWS S3: ‘s3’ / ‘aws_s3’ / ‘aws’ For Google Cloud Storage: ‘gcs’ / ‘google_storage’ / ‘google storage’ Everything else results just in local storage to disk
bucket_name (str) – name of the bucket in the cloud storage
cloud_dir_prefix (str) – path to the folder inside the bucket where the experiments are going to be saved
source_dirs (list or tuple) – paths to the local folders with the source code files used in experiment
collate_batch_pred_fn (callable) – collate function transforming batch predictions as they come out from the model
pred_transform_fn (callable) – function transforming all the produced predictions after all the batches have been run through the model
end_auto_eval (bool or int) – used to optionally disable otherwise automatic end of epoch/training val/test loss calculations. This is useful when conducting very costly experiments to save on compute time. Specify either True/False boolean to always run or never run after each epoch or specify an int to execute only every specified number of epochs.
lazy_experiment_save (bool) – when in lazy mode experiment tracking components will create the experiment folder only after some training results are available (possibly at the end of the first epoch) instead of at the beginning of training.
print_callbacks (bool) – at the start of training print the list of registered callbacks which will be executed during the run of the train loop
gpu_mode (str) –
GPU training mode selection. TrainLoop supports different GPU training modes by specifying one of the following:
'single'
: single GPU training'dp'
: multi-GPU training via DataParallel'ddp'
: multi-GPU training via DistributedDataParallel
cuda_device_idx (int or None) – CUDA device index used when training on multiple GPUs
Use 16-bit Automatic Mixed Precision (AMP).
To switch to AMP mode either:
set this parameter to
True
to use default AMPGradScaler
initialization paramsprovide custom AMP
GradScaler
initialization parameters as a dict as this parameter
- class aitoolbox.torchtrain.train_loop.train_loop_tracking.TrainLoopCheckpointEndSave(model, train_loader, validation_loader, test_loader, optimizer, criterion, project_name, experiment_name, local_model_result_folder_path, hyperparams, val_result_package=None, test_result_package=None, cloud_save_mode='s3', bucket_name='model-result', cloud_dir_prefix='', source_dirs=(), rm_subopt_local_models=False, num_best_checkpoints_kept=2, iteration_save_freq=0, collate_batch_pred_fn=<function append_predictions>, pred_transform_fn=<function torch_cat_transf>, end_auto_eval=True, lazy_experiment_save=False, print_callbacks=False, gpu_mode='single', cuda_device_idx=None, use_amp=False)[source]
Bases:
TrainLoopEndSave
- TrainLoop both saving model check-pointing at the end of each epoch and model performance reporting
and model saving at the end of the training process
- Parameters:
model (TTModel or ModelWrap or TTDataParallel) – neural network model
train_loader (torch.utils.data.DataLoader) – data loader for train data set
validation_loader (torch.utils.data.DataLoader or None) – data loader for validation data set
test_loader (torch.utils.data.DataLoader or None) – data loader for test data set
optimizer (torch.optim.Optimizer or MultiOptimizer) – optimizer algorithm.
criterion (torch.nn.Module or MultiLoss or None) – criterion during the training procedure
project_name (str) – root name of the project
experiment_name (str) – name of the particular experiment
local_model_result_folder_path (str) – root local path where project folder will be created
hyperparams (dict) – used hyper-parameters. When running the TrainLoop from jupyter notebook in order to ensure the python experiment file copying to the experiment folder, the user needs to manually specify the python file path as the value for the experiment_file_path key. If running the training directly from the terminal the path deduction is done automatically.
val_result_package (AbstractResultPackage or None) – result package evaluated on validation data at the end of the training
test_result_package (AbstractResultPackage or None) – result package evaluated on test data at the end of the training
cloud_save_mode (str or None) – Storage destination selector. For AWS S3: ‘s3’ / ‘aws_s3’ / ‘aws’ For Google Cloud Storage: ‘gcs’ / ‘google_storage’ / ‘google storage’ Everything else results just in local storage to disk
bucket_name (str) – name of the bucket in the cloud storage
cloud_dir_prefix (str) – path to the folder inside the bucket where the experiments are going to be saved
source_dirs (list or tuple) – paths to the local folders with the source code files used in experiment
rm_subopt_local_models (bool or str) – if True, the deciding metric is set to ‘loss’. Give string metric name to set it as a deciding metric for suboptimal model removal. If metric name consists of substring ‘loss’ the metric minimization is done otherwise metric maximization is done
num_best_checkpoints_kept (int) – number of best performing models which are kept when removing suboptimal model checkpoints
iteration_save_freq (int) – frequency of saving the model checkpoint every specified number of training iterations
collate_batch_pred_fn (callable) – collate function transforming batch predictions as they come out from the model
pred_transform_fn (callable) – function transforming all the produced predictions after all the batches have been run through the model
end_auto_eval (bool or int) – used to optionally disable otherwise automatic end of epoch/training val/test loss calculations. This is useful when conducting very costly experiments to save on compute time. Specify either True/False boolean to always run or never run after each epoch or specify an int to execute only every specified number of epochs.
lazy_experiment_save (bool) – when in lazy mode experiment tracking components will create the experiment folder only after some training results are available (possibly at the end of the first epoch) instead of at the beginning of training.
print_callbacks (bool) – at the start of training print the list of registered callbacks which will be executed during the run of the train loop
gpu_mode (str) –
GPU training mode selection. TrainLoop supports different GPU training modes by specifying one of the following:
'single'
: single GPU training'dp'
: multi-GPU training via DataParallel'ddp'
: multi-GPU training via DistributedDataParallel
cuda_device_idx (int or None) – CUDA device index used when training on multiple GPUs
Use 16-bit Automatic Mixed Precision (AMP).
To switch to AMP mode either:
set this parameter to
True
to use default AMPGradScaler
initialization paramsprovide custom AMP
GradScaler
initialization parameters as a dict as this parameter