Train Loop
TrainLoop
inside module aitoolbox.torchtrain.train_loop.train_loop
is at the core of and most important
component of the entire AIToolbox package.
Common to all available TrainLoops is the PyTorch model training loop engine which automatically handles the deep learning training process. As part of this it does the batch feeding of data into the model, calculating loss and updating parameters for a specified number of epochs.
torchtrain
and by extension TrainLoop
has been designed with the ease of use in mind. One of the main design principles was to keep as much
training code as possible exactly the same as would be used in normally PyTorch. Consequently, the user can define
the dataset, dataloader and models in exactly the same way as it would be done when training directly with core PyTorch.
Having no need to modify the definitions of common PyTorch training components in order to use torchtrain makes it
very user-friendly and allows the user to apply torchtrain directly to projects which initially weren’t even coded with
AIToolbox in mind.
To train the model, all the user has to do is provide the TrainLoop with the model, train / validation / test dataloaders, loss function and the optimizer. That’s it.
Once the TrainLoop with all the necessary components has been created all that’s left is to start training the model.
Common to all the available TrainLoops is the fit()
method which
initiates the training process. The fit()
method
will train the provided model on the given training dataset in the training dataloader for the specified number of epochs.
Note
In order to use the aitoolbox.torchtrain.train_loop
the user has to define their models as a
aitoolbox.torchtrain.model.TTModel
which is a slightly modified AIToolbox specific variation of
the core PyTorch torch.nn.Module
. Please have a look at the TTModel section of the documentation
in order to learn how to define your TTModels compatible with TrainLoop supported training.
TrainLoop Variations
aitoolbox.torchtrain.train_loop
module consists of submodules
train_loop
and train_loop_tracking
which implement four different TrainLoop
variations:
The above listed TrainLoop options can be distinguished based on the varying extent of the automatic experiment tracking they do on top of the core training loop functionality. The available TrainLoops follow this naming convention:
name includes
Checkpoint
keyword: the TrainLoop will automatically save the model after each training epochname includes
EndSave
keyword: the TrainLoop will automatically evaluate final model performance and save the final model at the end of the training
TrainLoop
The simplest TrainLoop version which only performs the model training and does no experiment tracking and performance evaluation.
The API can be found in: TrainLoop
.
Example of the TrainLoop
used to train the model:
from aitoolbox.torchtrain.train_loop import *
model = CNNModel() # TTModel based neural model
train_loader = DataLoader(...)
val_loader = DataLoader(...)
test_loader = None
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
criterion = nn.NLLLoss()
tl = TrainLoop(
model,
train_loader, val_loader, test_loader,
optimizer, criterion
)
model = tl.fit(num_epochs=10)
TrainLoopCheckpoint
Same training process as in TrainLoop with additional automatic model checkpointing (saving) after every epoch. Model saving can be done only to the local disk or also to the cloud storage such as AWS S3.
The API can be found in: TrainLoopCheckpoint
.
from aitoolbox.torchtrain.train_loop import *
from aitoolbox.experiment.result_package.basic_packages import ClassificationResultPackage
hyperparams = {
'lr': 0.001,
'betas': (0.9, 0.999)
}
model = CNNModel() # TTModel based neural model
train_loader = DataLoader(...)
val_loader = DataLoader(...)
test_loader = DataLoader(...)
optimizer = optim.Adam(model.parameters(), lr=hyperparams['lr'], betas=hyperparams['betas'])
criterion = nn.NLLLoss()
tl = TrainLoopCheckpoint(
model,
train_loader, val_loader, test_loader,
optimizer, criterion,
project_name='train_loop_examples', experiment_name='TrainLoopCheckpoint_example',
local_model_result_folder_path='results_dir',
hyperparams=hyperparams,
cloud_save_mode='s3', bucket_name='cloud_results' # bucket_name should be set to the bucket on your S3
)
model = tl.fit(num_epochs=10)
TrainLoopEndSave
Same training process as in TrainLoop with additional automatic model checkpointing (saving) and model performance evaluation at the end of the training process. This way the TrainLoop ensures experiment tracking a the end of the training. Model and experiment results saving can be done only to the local disk or also to the cloud storage such as AWS S3.
The API can be found in: TrainLoopEndSave
.
For information about the ResultPackage
used in this example, have a look at the Result Package
section.
from aitoolbox.torchtrain.train_loop import *
from aitoolbox.experiment.result_package.basic_packages import ClassificationResultPackage
hyperparams = {
'lr': 0.001,
'betas': (0.9, 0.999)
}
model = CNNModel() # TTModel based neural model
train_loader = DataLoader(...)
val_loader = DataLoader(...)
test_loader = DataLoader(...)
optimizer = optim.Adam(model.parameters(), lr=hyperparams['lr'], betas=hyperparams['betas'])
criterion = nn.NLLLoss()
tl = TrainLoopEndSave(
model,
train_loader, val_loader, test_loader,
optimizer, criterion,
project_name='train_loop_examples', experiment_name='TrainLoopEndSave_example',
local_model_result_folder_path='results_dir',
hyperparams=hyperparams,
val_result_package=ClassificationResultPackage(),
test_result_package=ClassificationResultPackage(),
cloud_save_mode='s3', bucket_name='cloud_results' # bucket_name should be set to the bucket on your S3
)
model = tl.fit(num_epochs=10)
TrainLoopCheckpointEndSave
For the most complete experiment tracking it is recommended to use the this TrainLoop option. At its core it is the same training process as in TrainLoop with additional automatic model checkpointing (saving) after each epoch as well as automatic model checkpointing and model performance evaluation at the end of the training process. This way the TrainLoop ensures full experiment tracking with the maximum extent. Model and experiment results saving can be done only to the local disk or also to the cloud storage such as AWS S3.
The API can be found in: TrainLoopCheckpointEndSave
.
For information about the ResultPackage
used in this example, have a look at the Result Package
section.
For a full working example of the TrainLoopCheckpointEndSave
training, check out this
TrainLoopCheckpointEndSave example training script.
from aitoolbox.torchtrain.train_loop import *
from aitoolbox.experiment.result_package.basic_packages import ClassificationResultPackage
hyperparams = {
'lr': 0.001,
'betas': (0.9, 0.999)
}
model = CNNModel() # TTModel based neural model
train_loader = DataLoader(...)
val_loader = DataLoader(...)
test_loader = DataLoader(...)
optimizer = optim.Adam(model.parameters(), lr=hyperparams['lr'], betas=hyperparams['betas'])
criterion = nn.NLLLoss()
tl = TrainLoopCheckpointEndSave(
model,
train_loader, val_loader, test_loader,
optimizer, criterion,
project_name='train_loop_examples', experiment_name='TrainLoopCheckpointEndSave_example',
local_model_result_folder_path='results_dir',
hyperparams=hyperparams,
val_result_package=ClassificationResultPackage(),
test_result_package=ClassificationResultPackage(),
cloud_save_mode='s3', bucket_name='cloud_results' # bucket_name should be set to the bucket on your S3
)
model = tl.fit(num_epochs=10)