The goal of one of my ongoing projects is to find anomalies in medical data, where the data is ordered in some way. I imagine each line of data represents a hospital patient, and each value on a line is an hourly reading of some sort, such as blood pressure or heart rate.
In my first, early experiments, I created some synthetic pseudo-medical data that looks like:
# each row is a patient, 12 hour readings in columns # 0.2195, 0.2861, 0.3411, . . . 0.3116 0.2272, 0.3702, 0.1284, . . . 0.4122 0.0473, 0.2560, 0.1573, . . . 0.3471 . . .
There are n rows (where n is usually 100 or 200) and 12 columns.
One of my first explorations with this form of data used reconstruction error from a PyTorch neural autoencoder with a TransformerEncoder layer inserted. See jamesmccaffrey.wordpress.com/2024/04/24/pytorch-transformerencoder-reconstruction-error-anomaly-detection-for-ordered-data/. That experiment did not use an explicit embedding layer.
A later experiment used used reconstruction error from a PyTorch neural autoencoder with a TransformerEncoder layer inserted, and with an explicit numeric embedding layer. See jamesmccaffrey.wordpress.com/2024/05/14/pytorch-transformerencoder-reconstruction-error-anomaly-detection-for-sequential-data-using-a-custom-numeric-embedding-layer/.
In this blog post, I describe the dataset preparation for an experiment that is like the previous experiment (autoencoder, TransformerEncoder layer, numeric embedding layer) but applied to synthetic medical data that has a different, more realistic structure. Each row of my previous synthetic datasets was in sync, in the sense that the columns/readings were generated from the base hidden reading. This means each synthetic patient was admitted in the exact same state.
To make the synthetic data more realistic, I start each row of data at a different offset to simulate patients being admitted with the same underlying condition, but in a different starting state. I did this by generating 24 readings for each patient, and then selecting 12 consecutive readings from a random start reading.

Graph of the first four synthetic patient data items. Finding anomalies will be very challenging.
This idea is a bit difficult to explain in words and the code below probably explains better. The resulting synthetic medical data looks like:
0.9377, 0.8432, 0.8389, . . . 0.3571 0.1284, 0.0349, 0.1081, . . . 0.5560 0.3149, 0.3034, 0.1424, . . . 0.4546 0.4727, 0.7438, 0.6748, . . . 0.5281 . . .
Each data line/patient has the same underlying structure, but a different starting point. This means finding anomalies will be a very difficult task, but one that might be achieved using a neural system that has a Transformer component, which has an Attention component to take order into account. Maybe. That will be a topic for another day.

I sometimes think of anomalous medical data as mutated normal data. Three of my favorite old science fiction movies feature mutants who act as helpers to a race of aliens.
Left: In “Invaders from Mars” (1953), Martians are small octopus-like beings, but they have very large mutant helpers.
Center: In “Battle in Outer Space” (1959), aliens from the planet Natal attempt to conquer Earth. We never see the Natalians, but they use small, vaguely insectoid mutants as henchmen and helpers.
Right: In “This Planet Earth” (1955), aliens from the planet Metaluna come to the Earth and kidnap two scientists to get help to develop weapons to defeat their alien enemies, the Zagons. The scientists escape back to Earth after dealing with a Metaluna mutant guard.
Program code:
# make_medical_data_2.py
# more complex/realistic than make_medical_data.py
import numpy as np
# 24 base readings
base = np.array([0.2000, 0.2000, 0.3000, 0.5000,
0.4000, 0.8000, 0.7000, 0.7000,
0.5000, 0.3000, 0.2000, 0.3000,
0.2000, 0.2000, 0.3000, 0.5000,
0.4000, 0.8000, 0.7000, 0.7000,
0.5000, 0.3000, 0.2000, 0.3000],
dtype=np.float64)
# print("\nbaseline data: ")
# print(base)
rnd = np.random.RandomState(0)
n = 100 # number of patients
# rnd = np.random.RandomState(1)
# n = 200 # number of patients
dim = 12 # 12 hours
for i in range(n):
start_i = rnd.randint(0, 12) # 0 to 11 inclusive
# print(str(i) + " ", end="")
for j in range(dim):
# print(start_i + j); input()
x = base[start_i + j]
xx = x + rnd.uniform(low = -0.20, high = 0.20)
print("%0.4f" % xx, end="")
if j < dim-1: print(", ", end="")
print("")
print("\nDone ")
.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.