Creating Some Semi-Realistic Synthetic Medical Data

The goal of one of my ongoing projects is to find anomalies in medical data, where the data is ordered in some way. I imagine each line of data represents a hospital patient, and each value on a line is an hourly reading of some sort, such as blood pressure or heart rate.

In my first, early experiments, I created some synthetic pseudo-medical data that looks like:

# each row is a patient, 12 hour readings in columns
#
0.2195, 0.2861, 0.3411, . . . 0.3116
0.2272, 0.3702, 0.1284, . . . 0.4122
0.0473, 0.2560, 0.1573, . . . 0.3471
. . .

There are n rows (where n is usually 100 or 200) and 12 columns.

One of my first explorations with this form of data used reconstruction error from a PyTorch neural autoencoder with a TransformerEncoder layer inserted. See jamesmccaffrey.wordpress.com/2024/04/24/pytorch-transformerencoder-reconstruction-error-anomaly-detection-for-ordered-data/. That experiment did not use an explicit embedding layer.

A later experiment used used reconstruction error from a PyTorch neural autoencoder with a TransformerEncoder layer inserted, and with an explicit numeric embedding layer. See jamesmccaffrey.wordpress.com/2024/05/14/pytorch-transformerencoder-reconstruction-error-anomaly-detection-for-sequential-data-using-a-custom-numeric-embedding-layer/.

In this blog post, I describe the dataset preparation for an experiment that is like the previous experiment (autoencoder, TransformerEncoder layer, numeric embedding layer) but applied to synthetic medical data that has a different, more realistic structure. Each row of my previous synthetic datasets was in sync, in the sense that the columns/readings were generated from the base hidden reading. This means each synthetic patient was admitted in the exact same state.

To make the synthetic data more realistic, I start each row of data at a different offset to simulate patients being admitted with the same underlying condition, but in a different starting state. I did this by generating 24 readings for each patient, and then selecting 12 consecutive readings from a random start reading.

Graph of the first four synthetic patient data items. Finding anomalies will be very challenging.

This idea is a bit difficult to explain in words and the code below probably explains better. The resulting synthetic medical data looks like:

0.9377, 0.8432, 0.8389, . . . 0.3571
0.1284, 0.0349, 0.1081, . . . 0.5560
0.3149, 0.3034, 0.1424, . . . 0.4546
0.4727, 0.7438, 0.6748, . . . 0.5281
. . .

Each data line/patient has the same underlying structure, but a different starting point. This means finding anomalies will be a very difficult task, but one that might be achieved using a neural system that has a Transformer component, which has an Attention component to take order into account. Maybe. That will be a topic for another day.

I sometimes think of anomalous medical data as mutated normal data. Three of my favorite old science fiction movies feature mutants who act as helpers to a race of aliens.

Left: In “Invaders from Mars” (1953), Martians are small octopus-like beings, but they have very large mutant helpers.

Center: In “Battle in Outer Space” (1959), aliens from the planet Natal attempt to conquer Earth. We never see the Natalians, but they use small, vaguely insectoid mutants as henchmen and helpers.

Right: In “This Planet Earth” (1955), aliens from the planet Metaluna come to the Earth and kidnap two scientists to get help to develop weapons to defeat their alien enemies, the Zagons. The scientists escape back to Earth after dealing with a Metaluna mutant guard.

Program code:

# make_medical_data_2.py

# more complex/realistic than make_medical_data.py

import numpy as np

# 24 base readings
base = np.array([0.2000, 0.2000, 0.3000, 0.5000,
                 0.4000, 0.8000, 0.7000, 0.7000, 
                 0.5000, 0.3000, 0.2000, 0.3000,
                 0.2000, 0.2000, 0.3000, 0.5000,
                 0.4000, 0.8000, 0.7000, 0.7000, 
                 0.5000, 0.3000, 0.2000, 0.3000],
                 dtype=np.float64)

# print("\nbaseline data: ")
# print(base)

rnd = np.random.RandomState(0)
n = 100    # number of patients

# rnd = np.random.RandomState(1)
# n = 200    # number of patients

dim = 12  # 12 hours
for i in range(n):
  start_i = rnd.randint(0, 12)  # 0 to 11 inclusive
  # print(str(i) + " ", end="")
  for j in range(dim):
    # print(start_i + j); input()
    x = base[start_i + j]
    xx = x + rnd.uniform(low = -0.20, high = 0.20)
    print("%0.4f" % xx, end="")
    if j < dim-1: print(", ", end="")
  print("")

print("\nDone ")