Creating an Autoencoder for Data Anomaly Detection using LightGBM

A couple of months ago, I put together a demo that used the LightGBM system to perform anomaly detection. My design used a set of loosely coupled Python functions to simulate an autoencoder. Early one morning before work, I wondered if it would be possible to encapsulate the code into an explicit Autoencoder class. Bottom line: yes, it is possible.

LightGBM (light gradient boosting machine) is a sophisticated tree-based tool for classification and regression. My earlier exploration created an autoencoder system that is a collection of individual functions — sort of a virtual autoencoder.

Briefly, if you have source data with n columns, you create n separate regression models, one for each column. The model[0] predicts the value in column [0] using the other columns. The model[1] predicts the value in column [1] using the other columns. And so on. Together, the n models can predict their input.

Then, you feed all the source data to the autoencoder, and get predicted/reconstructed data. The data item that is predicted most poorly (largest reconstruction error) is an anomaly.

I created a demo using a tiny 10-item subset of one of my standard synthetic datasets. The raw tab-delimited data is:

F   24   michigan   29500.00   liberal
M   39   oklahoma   51200.00   moderate
F   63   nebraska   75800.00   conservative
M   36   michigan   44500.00   moderate
F   27   nebraska   28600.00   liberal
F   50   nebraska   56500.00   moderate
F   50   oklahoma   55000.00   moderate
M   19   oklahoma   32700.00   conservative
F   22   nebraska   27700.00   moderate
M   39   oklahoma   47100.00   liberal

Each line is a person. The fields are sex, age, State (one of three), income, and political leaning (one of three). I encoded sex as M = 0.0, F = 1.0, State as Michigan = 0.0, Nebraska = 0.5, Oklahoma = 1.0, and politics as conservative = 0.0, moderate = 0.5, liberal = 1.0. The idea is to encode so that all values are between 0.0 and 1.0 so that large values don’t overwhelm small values during the calculation of reconstruction error. I divided age values by 100, and income values by 100,000. The resulting 10-item comma-delimited encooded and normalized data is:

1.0000, 0.2400, 0.0000, 0.2950, 1.0000
0.0000, 0.3900, 1.0000, 0.5120, 0.5000
1.0000, 0.6300, 0.5000, 0.7580, 0.0000
0.0000, 0.3600, 0.0000, 0.4450, 0.5000
1.0000, 0.2700, 0.5000, 0.2860, 1.0000
1.0000, 0.5000, 0.5000, 0.5650, 0.5000
1.0000, 0.5000, 1.0000, 0.5500, 0.5000
0.0000, 0.1900, 1.0000, 0.3270, 0.0000
1.0000, 0.2200, 0.5000, 0.2770, 0.5000
0.0000, 0.3900, 1.0000, 0.4710, 1.0000

So, the demo creates five regression models. The first model predicts a sex value in column [0] using the values in columns [1], [2], [3], [4]. The second model predicts an age value in column [1] using the values in columns [0], [2], [3], [4]. And so on.

To create the five models, for each column, I randomly selected 80% of the rows for training. The idea here is that tree-based models will often overfit so if all of the rows of the data are used, the predictions could be perfect and there will be no reconstruction error.

The statements that create and train the LGBM-based autoencoder are:

  print("Creating autoencoder model ")
  print("dim = 5 ")
  print("n_estimators = 50 ")
  print("min_leaf = 2 ")
  print("learn_rate = 0.05 ")
  ae_model = Autoencoder(5, 50, 2, 0.05)
  print("Done ")

  print("Training model ")
  ae_model.train(data_XY)
  print("Done ")

  print("Analyzing all data for reconstruction error ")
  analyze(ae_model, data_XY)

  analyze2(ae_model, data_XY)
  print("End demo ")

Behind the scenes, the key lines of code in the train() method are like:

n = len(data_XY)  # 10
all_rows = np.arange(n)  # 0, 1, . . 9
selected_rows = np.random.choice(all_rows,
  size=int(n * 0.80), replace=False)
data_partial = data_all[selected_rows,:]  # 8 items

train_x = np.delete(data_partial, target_col, axis=1)
train_y = data_partial[:, target_col]

params = {
  'objective': 'regression',  # not needed
  'n_estimators': 100,  # default = 100
  'learning_rate': 0.05,  # default = 0.10
  'min_data_in_leaf': 2,  # default = 20
  'random_state': 0,
  'verbosity': -1
 } 
sub_model = L.LGBMRegressor(**params) 
sub_model.fit(train_x, train_y)

A big challenge when working with LightGBM is the huge number of architecture and training parameters — over 100. There’s no easy way to deal with the huge number of parameters other than using trial and error, either manually or programmatically.

I set the verbosity parameter to -1 to suppress warning and error messages to keep the demo output tidy. In a non-demo scenario, you want to see all messages.

For reconstruction error, I used Euclidean distance between the source vector and the reconstructed vector. here are other possibilities but Euclidean distance seems fine to me.

I implemented two functions to analyze the reconstructed data items. The first analyze() function computes the item with the largest reconstruction error:

Analyzing all data for reconstruction error

Most anomalous idx = [9]
Item:           [ 0.0000  0.3900  1.0000  0.4710  1.0000]
Reconstruction: [ 1.0000  0.2400  0.0000  0.2950  1.0000]
Error = 0.7887

The second analyze2() function computes all reconstruction errors and sorts them from largest error to smallest. My demo displays the five most anomalous items:

Top 5 most anomalous items and error:
9 :   0.7887
8 :   0.7121
3 :   0.7115
7 :   0.4974
2 :   0.4139

It was an interesting exploration. I’ve noticed that when I create a non-trivial program, I almost always start by designing separate static functions. Then, after the code is up and running, I refactor to a class structure. I almost never start with a class structure — probably just the way my brain works.

The 1990s had some very strange but wonderfully creative animated cartoon shows on TV. Here are three of my favorites.

Left: “Aaahh!!! Real Monsters” (1994) features three young but nice monsters. Oblina (the smart one, like a weird candy cane), Krumm (a hulking somewhat dim-witted monster who holds his eyes in his hands), and Ickis (the cranky one, sort of a demonic rabbit). They live underneath a garbage dump in a city of monsters and attend their monster school. They have all kinds of bizarre but entertaining adventures.

Center: “Rocko’s Modern Life” (1993) features the surreal life of an Australian wallaby named Rocko and his friends including a naive steer named Heffer Wolfe, a neurotic turtle named Filburt, and Rocko’s enthusiastic dog Spunky. Weirdly wildly entertaining.

Right: “CatDog” (1998) tells stories about the life of conjoined brothers of different species. The cat half is cynical; the dog half is always optimistic.

Here is my updated version that encapsulates the autoencoder functionality into an explicit Autoencoder class. Replace “gt” (greater than) with Boolean operator symbol.

# people_anomaly_lgbm.py
# custom LightGBM autoencoder reconstruction error
# Anaconda3 2023.09-0  Python 3.11.5  LightGBM 4.3.0

import numpy as np
import lightgbm as L

# -----------------------------------------------------------

class Autoencoder():
  def __init__(self, dim, n_estimators, min_leaf, lrn_rate):
    self.dim = dim
    self.n_estimators = n_estimators
    self.min_leaf = min_leaf
    self.lrn_rate = lrn_rate
    self.sub_models = []  # list of LGBMRegressor models

  def train(self, data_all):
    for j in range(self.dim):  # each column
      n = len(data_all)  # use 80% of rows
      all_rows = np.arange(n)
      selected_rows = np.random.choice(all_rows,
        size=int(n * 0.80), replace=False)
      data_partial = data_all[selected_rows,:]
      train_x = np.delete(data_partial, j, axis=1)
      train_y = data_partial[:, j]

      params = {
        'objective': 'regression',  # not needed
        'boosting_type': 'gbdt',  # default
        'n_estimators': self.n_estimators,  # default = 100
        'num_leaves': 31,  # default
        'learning_rate': self.lrn_rate,  # default = 0.10
        'feature_fraction': 1.0,  # default
        'min_data_in_leaf': self.min_leaf,  # default = 20
        'random_state': 0,
        'verbosity': -1
      } 
      sub_model = L.LGBMRegressor(**params) 
      sub_model.fit(train_x, train_y)
      self.sub_models.append(sub_model)

  def predict(self, x):
    # x is 1D
    x = x.reshape(1, -1)  # 2D for LGBMRegressor.predict()
    result = np.zeros(self.dim, dtype=np.float64)
    for i in range(self.dim):
      xx = np.delete(x, i, axis=1)  # peel away target col
      pred = self.sub_models[i].predict(xx)
      result[i] = pred
    return result

# -----------------------------------------------------------
# -----------------------------------------------------------

def analyze(model, data_XY):
  n = len(data_XY)
  most_anom_idx = 0
  most_anom_recon = data_XY[0]
  largest_err = 0.0
  for i in range(n):
    x = data_XY[i]
    y = model.predict(x)
    err = np.linalg.norm(x-y)
    if err "gt" largest_err:
      largest_err = err
      most_anom_idx = i
      most_anom_item = x

  print("\nMost anomalous idx = [" + str(most_anom_idx) + "]")
  print("Item:           ", end="")
  print(x)
  print("Reconstruction: ", end="")
  print(most_anom_recon)
  print("Error = %0.4f " % largest_err)

# -----------------------------------------------------------

def analyze2(model, data_XY):
  n = len(data_XY)
  ids = np.arange(n, dtype=np.int64)  # 0, 1, 2, . .
  errors = np.zeros(n, dtype=np.float64)
  for i in range(n):
    x = data_XY[i]
    y = model.predict(x)
    err = np.linalg.norm(x-y)
    errors[i] = err

  sorted_error_idxs = np.flip(np.argsort(errors))
  sorted_errors = errors[sorted_error_idxs]
  sorted_ids = ids[sorted_error_idxs]

  print("\nTop 5 most anomalous items and error: ")
  for i in range(5):
    print(str(sorted_ids[i]) + " : ", end="")
    print("%8.4f" % sorted_errors[i])

# -----------------------------------------------------------

def main():
  print("\nAnomaly detection using LightGBM autoencoder ")
  np.random.seed(1)
  np.set_printoptions(precision=4, suppress=True,
    floatmode='fixed', sign=" ")

  print("\nLoading source data ")
  src = ".\\Data\\people_10.txt"  # tiny subset
  # 1.0000, 0.2400, 0.0000, 0.2950, 1.0000
  # 0.0000, 0.3900, 1.0000, 0.5120, 0.5000
  # . . .
  # sex     age     State   income  politics
  data_XY = np.loadtxt(src, usecols=[0,1,2,3,4],
    delimiter=',', comments="#", dtype=np.float64)

  print("\nFirst 3 rows source data: ")
  for i in range(3):
    print(data_XY[i])

  print("\nCreating autoencoder model ")
  print("dim = 5 ")
  print("n_estimators = 50 ")
  print("min_leaf = 2 ")
  print("learn_rate = 0.05 ")
  ae_model = Autoencoder(5, 50, 2, 0.05)
  print("Done ")

  print("\nTraining model ")
  ae_model.train(data_XY)
  print("Done ")

  print("\nFirst 3 predicted data items: ")
  for i in range(3):
    x = data_XY[i]  # x is 1D
    y = ae_model.predict(x)
    print(y)

  print("\nAnalyzing all data for reconstruction error ")
  analyze(ae_model, data_XY)

  analyze2(ae_model, data_XY)
  print("\nEnd demo ")

if __name__ == "__main__":
  main()