Deep neural systems based on Transformer Architecture (TA, also called multi-headed attention models) have revolutionized natural language processing (NLP). TA systems were designed to deal with sequence-to-sequence problems, such as translating English text to German text. TA systems can also handle sequence-to-value problems, such as sentiment analysis.
I came across an interesting example in the Keras library documentation that used Transformer Architecture to perform time series classification. This is a sequence-to-value problem where the sequence data is numeric rather than word-tokens in a sentence.
Specifically, the example program created a binary classifier for the Ford time series data. The Ford A dataset has 3601 training items and 1320 test items. Each data item has 500 time series values between about -5.0 and +5.0 that represent a measurement of engine noise. Each of the 500 measurement values were captured at an even interval (like perhaps 10 milliseconds). Each time series item is classified as -1 (no engine symptom) or +1 (engine symptom).
Note: I tracked down the source research paper for the Ford time series data but I don’t remember the details. The important idea is that there is numeric time series data and each series has a class label to predict. This is not at all the same as a time series regression problem where each time series is unlabeled and the goal is to predict the next numeric value in the series.

The first class -1 and first class +1 time series items. The demo program converts -1 labels to 0 labels.
Whenever I find an interesting code example that I want to explore, my first step is to refactor the example. This forces me to examine every line of code. So that’s what I did.

I only ran the demo for 10 epochs, which took about 6000 seconds = 100 minutes = an hour and a half. Running longer would improve the accuracy on the test data to about 95%.

The model summary. TA systems are not simple.
As expected, the demo code is extremely complicated. Relative to the number of lines of code, Transformer Architecture systems are by far the most complex software systems I work with.
I fiddled with the demo TA example for several hours. How, or even “if”, this exploration will eventually pay off is not clear. But that’s what doing research is all about. If I get some time, my next step will be to refactor the Keras code to PyTorch.

Three covers for “A Princess of Mars”, by Edgar Rice Burroughs. The book is one of the most influential in the history of science fiction. The story first appeared in serialized form in “All-Story Magazine” in 1912, and was compiled to book form in 1917. Over time, different artists produced similar but clearly different styles. Left: By artist Frank Schoonover (1917). Center: By Robert Abbett (1963). Right: By Gino D’Achille (1973).
Code below. Long.
# ford_tsc_transformer.py
# time series classification using a Keras
# Transformer (Multiheaded Attention)
# from https://keras.io/examples/timeseries/
# timeseries_transformer_classification/
# Keras 2.6.0 in TensorFlow 2.6.0
# Anaconda3-2020.02 Python 3.7.6 Windows 10
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2' # suppress CPU warn
import numpy as np
import tensorflow as tf
from tensorflow import keras as K
import matplotlib.pyplot as plt
# -----------------------------------------------------------
# http://www.j-wichard.de/publications/FordPaper.pdf
# The Ford data has 3601 training items and 1320 test items.
# Each item has 500 input values, roughly between -5.0
# and +5.0. Each item is class -1 (no symptom) or +1.
def load_xy(fn):
xy = np.loadtxt(fn, delimiter="\t")
y = xy[:, 0] # labels in column 0 are (-1,+1)
y[y == -1] = 0 # convert labels from (-1,+1) to (0,1)
x = xy[:, 1:] # 500 features
return x, y.astype(int)
# -----------------------------------------------------------
def trans_encoder(inputs, head_size, n_heads, ff_dim,
drop=0.0):
# MultiHeadAttention
x = K.layers.LayerNormalization(epsilon=1e-6)(inputs)
x = K.layers.MultiHeadAttention(
key_dim=head_size, num_heads=n_heads, dropout=drop)(x, x)
x = K.layers.Dropout(drop)(x)
res = x + inputs
x = K.layers.LayerNormalization(epsilon=1e-6)(res)
x = K.layers.Conv1D(filters=ff_dim, kernel_size=1,
activation="relu")(x)
x = K.layers.Dropout(drop)(x)
x = K.layers.Conv1D(filters=inputs.shape[-1],
kernel_size=1)(x)
return x + res
# -----------------------------------------------------------
def create_model(input_shape, head_size, n_heads, ff_dim,
n_trans_blocks, mlp_units, drop=0.0, mlp_drop=0.0):
inpts = K.Input(shape=input_shape)
x = inpts
for _ in range(n_trans_blocks):
x = trans_encoder(x, head_size, n_heads, ff_dim, drop)
x = K.layers.\
GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = K.layers.Dense(dim, activation="relu")(x)
x = K.layers.Dropout(mlp_drop)(x)
oupts = K.layers.Dense(2, activation="softmax")(x)
return K.Model(inpts, oupts)
# -----------------------------------------------------------
def main():
# 0. get ready
print("\nBegin Transformer classification demo ")
np.random.seed(1)
tf.random.set_seed(1)
print("Using Keras: " + str(K.__version__))
# 1. load training and test data
print("\nLoading Ford time series classification data ")
data_root = "https://raw.githubusercontent.com/"
data_root += "hfawaz/cd-diagram/master/FordA/"
x_train, y_train = load_xy(data_root + "FordA_TRAIN.tsv")
x_test, y_test = load_xy(data_root + "FordA_TEST.tsv")
# show one class 0 and one class 1 item
# print("\nExample class 0 (no symptom) and one class 1 item: ")
# classes = [0,1]
# plt.figure()
# for c in classes:
# c_x_train = x_train[y_train == c]
# plt.plot(c_x_train[0], label="class " + str(c))
# plt.legend(loc="best")
# plt.show()
# plt.close()
r = x_train.shape[0]
c = x_train.shape[1]
x_train = x_train.reshape((r, c, 1)) # (3601, 500, 1)
r = x_test.shape[0]
c = x_test.shape[1]
x_test = x_test.reshape((r, c, 1)) # (1320, 500, 1)
# 2. create model
print("\nCreating Transformer model ")
input_shape = x_train.shape[1:] # (500,1)
model = create_model(input_shape, head_size=256, n_heads=4,
ff_dim=4, n_trans_blocks=4, mlp_units=[128],
drop=0.25, mlp_drop=0.4)
model.compile(loss="sparse_categorical_crossentropy",
optimizer=K.optimizers.Adam(learning_rate=1.0e-4),
metrics=["sparse_categorical_accuracy"])
# print("\nModel summary: ")
# ms = model.summary()
# print(ms)
# 3. train model
c_backs = [K.callbacks.EarlyStopping(patience=10, \
restore_best_weights=True)]
print("\nStarting training ")
model.fit(x_train, y_train, validation_split=0.2,
epochs=10, batch_size=36, shuffle=True,
callbacks=c_backs)
print("Training complete")
# 4. use model
print("\nSetting up inpt of 500 random values in (0,1) ")
inpt = np.random.random((1,500,1)) # float64 will convert
pred = model(inpt)
print("Prediction: ")
print(pred)
print("\nEnd demo ")
# -----------------------------------------------------------
if __name__ == "__main__":
main()
.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.