Callbacks

For advanced model training experiments the basic logic offered in available TrainLoops might not be enough. Additional needed logic can be injected into the training procedure by using callbacks and providing them as a parameter list to fit() function found in all TrainLoops.

Available Callbacks

AIToolbox by default already offers a wide selection of different useful callbacks which can be used to augment the base training procedure. These out of the box callbacks can be found in aitoolbox.torchtrain.callbacks module. There are several general categories of available callbacks:

Example of the several basic callbacks used to infuse additional logic into the model training process:

from aitoolbox.torchtrain.train_loop import *
from aitoolbox.torchtrain.callbacks.basic import EarlyStopping, TerminateOnNaN, AllPredictionsSame


model = CNNModel()  # TTModel based neural model
train_loader = DataLoader(...)
val_loader = DataLoader(...)
test_loader = DataLoader(...)

optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
criterion = nn.NLLLoss()

callbacks = [
    EarlyStopping(patience=3),
    TerminateOnNaN(),
    AllPredictionsSame(value=0.)
]

tl = TrainLoop(model,
               train_loader, val_loader, test_loader,
               optimizer, criterion)

model = tl.fit(num_epochs=10, callbacks=callbacks)

For a full working example which shows the use of multiple callbacks of various types, check out this fully tracked training experiment example.

Implementing New Callbacks

However when some completely new functionality is desired which is not available out of the box in AIToolbox the user can also implement their own custom callbacks. These can then be used as any other callback to further extend the training loop process.

AbstractCallback

The new callback can be implemented as a new class which is inheriting from the base callback AbstractCallback. All that the user has to do is to override and implement the methods corresponding to positions in the TrainLoop training process at which the newly developed callback should be executed. If a certain callback method is left unimplemented and thus left to the default from the parent AbstractCallback the callback has no effect on the TrainLoop at the corresponding position in the training process.

Callback execution is currently supported at the following positions in the TrainLoop via the following methods:

train_loop_obj

The most usable and thus important aspect of every callback is its ability to communicate and modify the encapsulating running TrainLoop. Every callback has a special attribute train_loop_obj which at the start of the TrainLoop training process gets assigned the reference (pointer) to the encapsulating TrainLoop object. In AIToolbox the process is called TrainLoop registration and is automatically done under the hood by the TrainLoop by calling the register_train_loop_object().

Via the train_loop_obj the callback can thus have a complete access to and control of every aspect of the TrainLoop. While maybe dangerous for inexperienced users, this extensive low level control is especially welcome for the advanced research use of AIToolbox. After the train loop object registration inside the callback the reference to the encapsulating TrainLoop can be simply accessed from any implemented callback method via self.train_loop_obj.

Custom Callback Example

Example of a newly developed callback and its use in the TrainLoop:

from aitoolbox.torchtrain.train_loop import *
from aitoolbox.torchtrain.callbacks.abstract import AbstractCallback
from aitoolbox.torchtrain.callbacks.basic import EarlyStopping, TerminateOnNaN, AllPredictionsSame


class MyDemoTrainingReportCallback(AbstractCallback):
    def __init__(self):
        super().__init__('simple callback example')

    def on_train_begin(self):
        experiment_start_time = self.train_loop_obj.experiment_timestamp
        print(f'Starting the training! Experiment started at: {experiment_start_time}')

    def on_epoch_begin(self):
        current_epoch = self.train_loop_obj.epoch
        print(f'Starting new epoch num {current_epoch}')

    def on_epoch_end(self):
        val_predictions = self.train_loop_obj.predict_on_validation_set()
        print('Model predictions:')
        print(val_predictions)

    def on_train_end(self):
        print(f'End of training! Stopped at epoch {self.train_loop_obj.epoch}')

        test_predictions = self.train_loop_obj.predict_on_test_set()
        print('Model predictions:')
        print(test_predictions)


model = CNNModel()  # TTModel based neural model
train_loader = DataLoader(...)
val_loader = DataLoader(...)
test_loader = DataLoader(...)

optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
criterion = nn.NLLLoss()

callbacks = [
    MyDemoTrainingReportCallback(),
    EarlyStopping(patience=3),
    TerminateOnNaN(),
    AllPredictionsSame(value=0.)
]

tl = TrainLoop(model,
               train_loader, val_loader, test_loader,
               optimizer, criterion)

model = tl.fit(num_epochs=10, callbacks=callbacks)

AbstractExperimentCallback

In case of the developed callback is aimed at experiment tracking where information about the created experiment details such as project name, experiment name and path of the local experiment folder would be needed there is available also available the AbstractExperimentCallback. AbstractExperimentCallback has all the same properties as basic AbstractCallback and is extended with the convenience method try_infer_experiment_details() which extracts the experiment details from the running TrainLoop and infuses our callback with this additional needed information.

For the example of the try_infer_experiment_details() use in practice check this implementation aitoolbox.torchtrain.callbacks.performance_eval.ModelTrainHistoryPlot.on_train_loop_registration().

DDP Multi-Processing Callbacks

When the callbacks are used during the DistributedDataParallel TrainLoop (more about this can be found in Multi-GPU Training), by default they are executed in each of the running processes. This behaviour can be desired, however in certain situations the opposite is required and the callback should only be executed in one lead process.

When developing such a callback which is intended to be executed only in one of the spawned processes the torchtrain callbacks framework enables this via the device_idx_execution parameter which is part of every callback inherited from the AbstractCallback. It tells the TrainLoop engine as part of which process and corresponding GPU device id the callback should be executed. For exmaple if the callback has device_idx_execution set to 0, this means that the callback will only be executed as part of the process which is running on the first GPU. When device_idx_execution is set to None which is the default, the callback is executed inside every running process.

Simple example callback that gets executed in only the process running on the first GPU:

from aitoolbox.torchtrain.callbacks.abstract import AbstractCallback


class DemoFirstGPUCallback(AbstractCallback):
    def __init__(self):
        super().__init__('first GPU callback example',
                         device_idx_execution=0)

    def on_train_begin(self):
        ..... Some logic .....