In non-demo scenarios, training a neural network can take hours, days, weeks, or even longer. It’s not uncommon for machines to crash, so you should always save checkpoint information during training so that if your training machine crashes or hangs, you can recover without having to start from the beginning of training.
In pseudo-code, to save a state checkpoint every 100 training epochs:
set random number generators
create network
create training optimizer
for epoch in range(0, max_epochs):
reset random seed
for batch in range(0, bat_sz):
use batch inputs to compute predicteds
compare predicteds to targets to get loss
use loss to update network weights
if epoch mod 100 == 0:
print loss to monitor progress
get RNGs, network, optimizer states
save states to file
print("training complete")
In code, one of zillions of possibilities is:
if epoch % 100 == 0:
print("epoch = %4d loss = %0.4f" % (epoch, epoch_loss))
# save training checkpoint
dt = time.strftime("%Y_%m_%d-%H_%M_%S")
fn = ".\\Log\\" + str(dt) + str("-") + str(epoch) +
"_checkpoint.pt"
info_dict = { 'epoch' : epoch,
'torch_random_state' : T.random.get_rng_state(),
'numpy_random_state' : np.random.get_state(),
'net_state' : net.state_dict(),
'optimizer_state' : optimizer.state_dict() }
T.save(info_dict, fn)
The current date and time is used to create a year-month-day-hour-minute-second-epoch filename that looks like “2020_11_16-12_36_46-500_checkpoint.pt”. It’s usually a good idea to fetch and save the PyTorch and NumPy random number generator states because they can often be used in many different ways. The net.state_dict() method gets all the current weights and biases, and the optimizer.state_dict() method gets the optimizer state.
Unfortunately, there is no easy way to get the state of a PyTorch DataLoader object, so when you recover from a crash, your subsequent training will be a tiny bit different than if you hadn’t crashed. This usually doesn’t matter. You can avoid this and get reproducible results by resetting the PyTorch random number generator seed at the beginning of each epoch:
net.train() # or net = net.train()
for epoch in range(0, max_epochs):
T.manual_seed(1 + epoch) # for recovery reproducibility
epoch_loss = 0 # for one full epoch
for (batch_idx, batch) in enumerate(train_ldr):
X = batch['predictors'] # inputs
Y = batch['targets'] # correct class/label/job
. . .
If your training machine crashes during training, you can recover by resetting the RNGs (if necessary), the network, the optimizer, and the training epoch with code like this:
print("\nBegin recover failed training \n")
fn = ".\\Log\\2020_11_16-12_36_46_checkpoint.pt"
chkpt = T.load(fn) # get saved states dictionary
# reset RNGSs if needed
np.random.set_state(chkpt['numpy_random_state'])
T.random.set_rng_state(chkpt['torch_random_state'])
. . .
net = Net().to(device) # reset network
net.load_state_dict(chkpt['net_state'])
. . .
loss_func = T.nn.CrossEntropyLoss()
optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)
optimizer.load_state_dict(chkpt['optimizer_state'])
. . .
epoch_saved = chkpt['epoch'] + 1
for epoch in range(epoch_saved, max_epochs):
T.manual_seed(1 + epoch) # for recovery reproducibility
epoch_loss = 0 # for one full epoch
for (batch_idx, batch) in enumerate(train_ldr):
. . .
Essentially, your code is exactly like normal training code except that you adjust the RNGs, the network, the optimizer, and the starting epoch, using the saved states.
In the two screenshots above, you can see that I saved states every 100 epochs, then I used a second program to use the saved checkpoint after epoch 400 to restart from epoch 401 to simulate what would happen if training had crashed after epoch 400. Notice that the results are the same.
Disaster can strike unexpectedly when training a neural network. Disaster can also strike whenever you pose for a photograph with animals. Duck, goat, stingray.


.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2025 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2025 G2E Conference
2025 iSC West Conference
You must be logged in to post a comment.