PyTorch Transformers and the torch.set_num_threads() Function

Bottom line: Using the torch.set_num_threads() in a PyTorch program that has a Transformer module can significantly change the behavior of the program (in my case, for the better).

I was experimenting with a PyTorch program that uses a TransformerEncoder to do anomaly detection. See https://jamesmccaffreyblog.com/2022/07/25/testing-a-transformer-based-autoencoder-anomaly-detection-system/.

This program mysteriously stopped working one day.

During training I saw this:

Starting training
epoch =    0   loss = 1658.0013
epoch =   10   loss = 945.3817
epoch =   20   loss = 467.7127
epoch =   30   loss = 277.3138
epoch =   40   loss = 202.2976
epoch =   50   loss = 160.5351
epoch =   60   loss = 130.3890
epoch =   70   loss = 108.6009
epoch =   80   loss = 92.6008
epoch =   90   loss = 81.1246
Done

The loss value steadily decreased which indicated that the network containing the TransformerEncoder was learning. Good.

Then, one morning, the exact same program on the exact same machine started showing:

Starting training
epoch =    0   loss = 1658.0013
epoch =   10   loss = 945.3808
epoch =   20   loss = 955.5437
epoch =   30   loss = 950.6722
epoch =   40   loss = 949.4239
epoch =   50   loss = 956.1644
epoch =   60   loss = 954.2082
epoch =   70   loss = 951.4501
epoch =   80   loss = 945.4361
epoch =   90   loss = 954.1126
Done

The loss value immediately got stuck and so the network was not learning. This had me baffled. Somewhat unfortunately, all my machines belong to my company and are joined to the company network domain, which means that they are constantly being updated. I assumed that one of the updates had changed something.

A couple of days later, I ran into my work pal Ricky L who is an expert with transformer architecture. I described the weirdness in my system to him. He said he wasn’t surprised and that one thing for me to try was to set the number of threads explicitly with the statement torch.set_num_threads(1).

I looked up set_num_threads() in the PyTorch documentation and found exactly three sentences:

TORCH.SET_NUM_THREADS
torch.set_num_threads(int)
Sets the number of threads used for intraop parallelism on CPU.

That wasn’t too helpful so I just added a global call at the top of my program:

# uci_trans_anomaly.py

# Transformer based reconstruction error for UCI Digits
# PyTorch 1.10.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import torch as T

device = T.device('cpu') 
T.set_num_threads(1)  # I added this statement

# -----------------------------------------------------------

class Transformer_Net(T.nn.Module):
  . . . etc

And viola! The program was working correctly again.

Note: You can check how many threads your machine is using by default with the torch.get_num_threads() function.

I still don’t know the exact cause of the change in behavior of my program. But the moral of the story is that calling set_num_threads() in programs that use a Transformer module might be a good idea.

A ventriloquist and his/her dummy have two output streams but only one underlying thread of execution. Three old albums from the 1960s. Left: Geraldine and Ricky. Center: Happy Harry and Uncle Weldon. Right: Chris Kirby and Terry. Do not click on image to enlarge unless you’re prepared for several years of nightmares.

This entry was posted in PyTorch. Bookmark the permalink.

1 Response to PyTorch Transformers and the torch.set_num_threads() Function

Thorsten Kleppe says:

October 27, 2022 at 3:19 am

Maybe you ran into the parallelism problem. Where not only one core, or thread processes the work, but multiple. Not nice if this happens automatically by an update. Interesting if then the network collapses, it seems an indication that this network in combination with the training method seems less robust.

With 1 thread, which as I have seen in the task manager can extend over multiple cores, the sequence is always the same. One of the CPUs in the system processes 1 task. When multiple threads are working in parallel, the sequence and the result changes.

You would always expect the same result, which is correct for one thread. With several threads, however, there is also the problem of rounding errors with floats. Also 1 thread suffers from it, but always the same, so the results remain reproducible.

Whew, hopefully you’re still with me because I’m struggling through it myself right now and hoping that practice and intuition agree so far.

A practical example, the delta value of a weight is formed by output gradient * input neuron. Here we multiply, e.g. 1.0f * 99.999999f = 100.0f. But actually the result should be less than 100.

In addition, the accumulation of the delta values usually in the size of the batch. Here then 01.f + 0.2f = 0.30000001f.

The small differences can therefore explain the different results. It must always be remembered that a computer with multiple cores working in parallel produces practically random results, the start time of each core is different, and also the processing for each example.

The phenom could also be observed under double, but it is probably less noticeable there. A solution could be to store the delta values of the whole batch and then sum them up in a single thread. But this seems foolish.

I am not completely convinced of my explanation, but I hope it is approximately correct.

Loading...