Time Series Classification Using a Keras Transformer Model

Deep neural systems based on Transformer Architecture (TA, also called multi-headed attention models) have revolutionized natural language processing (NLP). TA systems were designed to deal with sequence-to-sequence problems, such as translating English text to German text. TA systems can also handle sequence-to-value problems, such as sentiment analysis.

I came across an interesting example in the Keras library documentation that used Transformer Architecture to perform time series classification. This is a sequence-to-value problem where the sequence data is numeric rather than word-tokens in a sentence.

Specifically, the example program created a binary classifier for the Ford time series data. The Ford A dataset has 3601 training items and 1320 test items. Each data item has 500 time series values between about -5.0 and +5.0 that represent a measurement of engine noise. Each of the 500 measurement values were captured at an even interval (like perhaps 10 milliseconds). Each time series item is classified as -1 (no engine symptom) or +1 (engine symptom).

Note: I tracked down the source research paper for the Ford time series data but I don’t remember the details. The important idea is that there is numeric time series data and each series has a class label to predict. This is not at all the same as a time series regression problem where each time series is unlabeled and the goal is to predict the next numeric value in the series.

The first class -1 and first class +1 time series items. The demo program converts -1 labels to 0 labels.

Whenever I find an interesting code example that I want to explore, my first step is to refactor the example. This forces me to examine every line of code. So that’s what I did.

I only ran the demo for 10 epochs, which took about 6000 seconds = 100 minutes = an hour and a half. Running longer would improve the accuracy on the test data to about 95%.

The model summary. TA systems are not simple.

As expected, the demo code is extremely complicated. Relative to the number of lines of code, Transformer Architecture systems are by far the most complex software systems I work with.

I fiddled with the demo TA example for several hours. How, or even “if”, this exploration will eventually pay off is not clear. But that’s what doing research is all about. If I get some time, my next step will be to refactor the Keras code to PyTorch.

Three covers for “A Princess of Mars”, by Edgar Rice Burroughs. The book is one of the most influential in the history of science fiction. The story first appeared in serialized form in “All-Story Magazine” in 1912, and was compiled to book form in 1917. Over time, different artists produced similar but clearly different styles. Left: By artist Frank Schoonover (1917). Center: By Robert Abbett (1963). Right: By Gino D’Achille (1973).

Code below. Long.

# ford_tsc_transformer.py

# time series classification using a Keras 
# Transformer (Multiheaded Attention)
# from https://keras.io/examples/timeseries/
#  timeseries_transformer_classification/

# Keras 2.6.0 in TensorFlow 2.6.0 
# Anaconda3-2020.02  Python 3.7.6  Windows 10

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'  # suppress CPU warn

import numpy as np
import tensorflow as tf
from tensorflow import keras as K
import matplotlib.pyplot as plt

# -----------------------------------------------------------

# http://www.j-wichard.de/publications/FordPaper.pdf
# The Ford data has 3601 training items and 1320 test items.
# Each item has 500 input values, roughly between -5.0
# and +5.0. Each item is class -1 (no symptom) or +1.

def load_xy(fn):
  xy = np.loadtxt(fn, delimiter="\t")
  y = xy[:, 0]    # labels in column 0 are (-1,+1)
  y[y == -1] = 0  # convert labels from (-1,+1) to (0,1)
  x = xy[:, 1:]   # 500 features
  return x, y.astype(int)

# -----------------------------------------------------------

def trans_encoder(inputs, head_size, n_heads, ff_dim,
  drop=0.0):
  # MultiHeadAttention
  x = K.layers.LayerNormalization(epsilon=1e-6)(inputs)
  x = K.layers.MultiHeadAttention(
    key_dim=head_size, num_heads=n_heads, dropout=drop)(x, x)
  x = K.layers.Dropout(drop)(x)
  res = x + inputs

  x = K.layers.LayerNormalization(epsilon=1e-6)(res)
  x = K.layers.Conv1D(filters=ff_dim, kernel_size=1,
    activation="relu")(x)
  x = K.layers.Dropout(drop)(x)
  x = K.layers.Conv1D(filters=inputs.shape[-1],
    kernel_size=1)(x)
  return x + res

# -----------------------------------------------------------

def create_model(input_shape, head_size, n_heads, ff_dim,
  n_trans_blocks, mlp_units, drop=0.0, mlp_drop=0.0):

  inpts = K.Input(shape=input_shape)
  x = inpts
  for _ in range(n_trans_blocks):
    x = trans_encoder(x, head_size, n_heads, ff_dim, drop)

  x = K.layers.\
    GlobalAveragePooling1D(data_format="channels_first")(x)
  for dim in mlp_units:
    x = K.layers.Dense(dim, activation="relu")(x)
    x = K.layers.Dropout(mlp_drop)(x)
  oupts = K.layers.Dense(2, activation="softmax")(x) 
  return K.Model(inpts, oupts)

# -----------------------------------------------------------

def main():
  # 0. get ready
  print("\nBegin Transformer classification demo ")
  np.random.seed(1)
  tf.random.set_seed(1)
  print("Using Keras: " + str(K.__version__))

  # 1. load training and test data
  print("\nLoading Ford time series classification data ")
  data_root = "https://raw.githubusercontent.com/"
  data_root += "hfawaz/cd-diagram/master/FordA/"

  x_train, y_train = load_xy(data_root + "FordA_TRAIN.tsv")
  x_test, y_test = load_xy(data_root + "FordA_TEST.tsv")

  # show one class 0 and one class 1 item
  # print("\nExample class 0 (no symptom) and one class 1 item: ")
  # classes = [0,1]
  # plt.figure()
  # for c in classes:
  #   c_x_train = x_train[y_train == c]
  #   plt.plot(c_x_train[0], label="class " + str(c))
  # plt.legend(loc="best")
  # plt.show()
  # plt.close()

  r = x_train.shape[0]
  c = x_train.shape[1]
  x_train = x_train.reshape((r, c, 1))  # (3601, 500, 1)

  r = x_test.shape[0]
  c = x_test.shape[1]
  x_test = x_test.reshape((r, c, 1))  # (1320, 500, 1)

  # 2. create model
  print("\nCreating Transformer model ")
  input_shape = x_train.shape[1:]  # (500,1)

  model = create_model(input_shape, head_size=256, n_heads=4,
    ff_dim=4, n_trans_blocks=4, mlp_units=[128],
    drop=0.25, mlp_drop=0.4)
  
  model.compile(loss="sparse_categorical_crossentropy",
    optimizer=K.optimizers.Adam(learning_rate=1.0e-4),
    metrics=["sparse_categorical_accuracy"])

  # print("\nModel summary: ")
  # ms = model.summary()
  # print(ms)

  # 3. train model
  c_backs = [K.callbacks.EarlyStopping(patience=10, \
    restore_best_weights=True)]

  print("\nStarting training ")
  model.fit(x_train, y_train, validation_split=0.2,
    epochs=10, batch_size=36, shuffle=True,
    callbacks=c_backs)    
  print("Training complete")

  # 4. use model
  print("\nSetting up inpt of 500 random values in (0,1) ")
  inpt = np.random.random((1,500,1))  # float64 will convert
  pred = model(inpt)
  print("Prediction: ")
  print(pred)

  print("\nEnd demo ")

# -----------------------------------------------------------

if __name__ == "__main__":
  main()