Bottom line: Using the torch.set_num_threads() in a PyTorch program that has a Transformer module can significantly change the behavior of the program (in my case, for the better).
I was experimenting with a PyTorch program that uses a TransformerEncoder to do anomaly detection. See https://jamesmccaffreyblog.com/2022/07/25/testing-a-transformer-based-autoencoder-anomaly-detection-system/.

This program mysteriously stopped working one day.
During training I saw this:
Starting training epoch = 0 loss = 1658.0013 epoch = 10 loss = 945.3817 epoch = 20 loss = 467.7127 epoch = 30 loss = 277.3138 epoch = 40 loss = 202.2976 epoch = 50 loss = 160.5351 epoch = 60 loss = 130.3890 epoch = 70 loss = 108.6009 epoch = 80 loss = 92.6008 epoch = 90 loss = 81.1246 Done
The loss value steadily decreased which indicated that the network containing the TransformerEncoder was learning. Good.
Then, one morning, the exact same program on the exact same machine started showing:
Starting training epoch = 0 loss = 1658.0013 epoch = 10 loss = 945.3808 epoch = 20 loss = 955.5437 epoch = 30 loss = 950.6722 epoch = 40 loss = 949.4239 epoch = 50 loss = 956.1644 epoch = 60 loss = 954.2082 epoch = 70 loss = 951.4501 epoch = 80 loss = 945.4361 epoch = 90 loss = 954.1126 Done
The loss value immediately got stuck and so the network was not learning. This had me baffled. Somewhat unfortunately, all my machines belong to my company and are joined to the company network domain, which means that they are constantly being updated. I assumed that one of the updates had changed something.
A couple of days later, I ran into my work pal Ricky L who is an expert with transformer architecture. I described the weirdness in my system to him. He said he wasn’t surprised and that one thing for me to try was to set the number of threads explicitly with the statement torch.set_num_threads(1).
I looked up set_num_threads() in the PyTorch documentation and found exactly three sentences:
TORCH.SET_NUM_THREADS torch.set_num_threads(int) Sets the number of threads used for intraop parallelism on CPU.
That wasn’t too helpful so I just added a global call at the top of my program:
# uci_trans_anomaly.py
# Transformer based reconstruction error for UCI Digits
# PyTorch 1.10.0-CPU Anaconda3-2020.02 Python 3.7.6
# Windows 10/11
import numpy as np
import torch as T
device = T.device('cpu')
T.set_num_threads(1) # I added this statement
# -----------------------------------------------------------
class Transformer_Net(T.nn.Module):
. . . etc
And viola! The program was working correctly again.
Note: You can check how many threads your machine is using by default with the torch.get_num_threads() function.
I still don’t know the exact cause of the change in behavior of my program. But the moral of the story is that calling set_num_threads() in programs that use a Transformer module might be a good idea.

A ventriloquist and his/her dummy have two output streams but only one underlying thread of execution. Three old albums from the 1960s. Left: Geraldine and Ricky. Center: Happy Harry and Uncle Weldon. Right: Chris Kirby and Terry. Do not click on image to enlarge unless you’re prepared for several years of nightmares.
.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2025 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2025 G2E Conference
2025 iSC West Conference
Maybe you ran into the parallelism problem. Where not only one core, or thread processes the work, but multiple. Not nice if this happens automatically by an update. Interesting if then the network collapses, it seems an indication that this network in combination with the training method seems less robust.
With 1 thread, which as I have seen in the task manager can extend over multiple cores, the sequence is always the same. One of the CPUs in the system processes 1 task. When multiple threads are working in parallel, the sequence and the result changes.
You would always expect the same result, which is correct for one thread. With several threads, however, there is also the problem of rounding errors with floats. Also 1 thread suffers from it, but always the same, so the results remain reproducible.
Whew, hopefully you’re still with me because I’m struggling through it myself right now and hoping that practice and intuition agree so far.
A practical example, the delta value of a weight is formed by output gradient * input neuron. Here we multiply, e.g. 1.0f * 99.999999f = 100.0f. But actually the result should be less than 100.
In addition, the accumulation of the delta values usually in the size of the batch. Here then 01.f + 0.2f = 0.30000001f.
The small differences can therefore explain the different results. It must always be remembered that a computer with multiple cores working in parallel produces practically random results, the start time of each core is different, and also the processing for each example.
The phenom could also be observed under double, but it is probably less noticeable there. A solution could be to store the delta values of the whole batch and then sum them up in a single thread. But this seems foolish.
I am not completely convinced of my explanation, but I hope it is approximately correct.