A couple of months ago, I put together a demo that used the LightGBM system to perform anomaly detection. My design used a set of loosely coupled Python functions to simulate an autoencoder. Early one morning before work, I wondered if it would be possible to encapsulate the code into an explicit Autoencoder class. Bottom line: yes, it is possible.
LightGBM (light gradient boosting machine) is a sophisticated tree-based tool for classification and regression. My earlier exploration created an autoencoder system that is a collection of individual functions — sort of a virtual autoencoder.
Briefly, if you have source data with n columns, you create n separate regression models, one for each column. The model[0] predicts the value in column [0] using the other columns. The model[1] predicts the value in column [1] using the other columns. And so on. Together, the n models can predict their input.
Then, you feed all the source data to the autoencoder, and get predicted/reconstructed data. The data item that is predicted most poorly (largest reconstruction error) is an anomaly.
I created a demo using a tiny 10-item subset of one of my standard synthetic datasets. The raw tab-delimited data is:
F 24 michigan 29500.00 liberal M 39 oklahoma 51200.00 moderate F 63 nebraska 75800.00 conservative M 36 michigan 44500.00 moderate F 27 nebraska 28600.00 liberal F 50 nebraska 56500.00 moderate F 50 oklahoma 55000.00 moderate M 19 oklahoma 32700.00 conservative F 22 nebraska 27700.00 moderate M 39 oklahoma 47100.00 liberal
Each line is a person. The fields are sex, age, State (one of three), income, and political leaning (one of three). I encoded sex as M = 0.0, F = 1.0, State as Michigan = 0.0, Nebraska = 0.5, Oklahoma = 1.0, and politics as conservative = 0.0, moderate = 0.5, liberal = 1.0. The idea is to encode so that all values are between 0.0 and 1.0 so that large values don’t overwhelm small values during the calculation of reconstruction error. I divided age values by 100, and income values by 100,000. The resulting 10-item comma-delimited encooded and normalized data is:
1.0000, 0.2400, 0.0000, 0.2950, 1.0000 0.0000, 0.3900, 1.0000, 0.5120, 0.5000 1.0000, 0.6300, 0.5000, 0.7580, 0.0000 0.0000, 0.3600, 0.0000, 0.4450, 0.5000 1.0000, 0.2700, 0.5000, 0.2860, 1.0000 1.0000, 0.5000, 0.5000, 0.5650, 0.5000 1.0000, 0.5000, 1.0000, 0.5500, 0.5000 0.0000, 0.1900, 1.0000, 0.3270, 0.0000 1.0000, 0.2200, 0.5000, 0.2770, 0.5000 0.0000, 0.3900, 1.0000, 0.4710, 1.0000
So, the demo creates five regression models. The first model predicts a sex value in column [0] using the values in columns [1], [2], [3], [4]. The second model predicts an age value in column [1] using the values in columns [0], [2], [3], [4]. And so on.
To create the five models, for each column, I randomly selected 80% of the rows for training. The idea here is that tree-based models will often overfit so if all of the rows of the data are used, the predictions could be perfect and there will be no reconstruction error.
The statements that create and train the LGBM-based autoencoder are:
print("Creating autoencoder model ")
print("dim = 5 ")
print("n_estimators = 50 ")
print("min_leaf = 2 ")
print("learn_rate = 0.05 ")
ae_model = Autoencoder(5, 50, 2, 0.05)
print("Done ")
print("Training model ")
ae_model.train(data_XY)
print("Done ")
print("Analyzing all data for reconstruction error ")
analyze(ae_model, data_XY)
analyze2(ae_model, data_XY)
print("End demo ")
Behind the scenes, the key lines of code in the train() method are like:
n = len(data_XY) # 10
all_rows = np.arange(n) # 0, 1, . . 9
selected_rows = np.random.choice(all_rows,
size=int(n * 0.80), replace=False)
data_partial = data_all[selected_rows,:] # 8 items
train_x = np.delete(data_partial, target_col, axis=1)
train_y = data_partial[:, target_col]
params = {
'objective': 'regression', # not needed
'n_estimators': 100, # default = 100
'learning_rate': 0.05, # default = 0.10
'min_data_in_leaf': 2, # default = 20
'random_state': 0,
'verbosity': -1
}
sub_model = L.LGBMRegressor(**params)
sub_model.fit(train_x, train_y)
A big challenge when working with LightGBM is the huge number of architecture and training parameters — over 100. There’s no easy way to deal with the huge number of parameters other than using trial and error, either manually or programmatically.
I set the verbosity parameter to -1 to suppress warning and error messages to keep the demo output tidy. In a non-demo scenario, you want to see all messages.
For reconstruction error, I used Euclidean distance between the source vector and the reconstructed vector. here are other possibilities but Euclidean distance seems fine to me.
I implemented two functions to analyze the reconstructed data items. The first analyze() function computes the item with the largest reconstruction error:
Analyzing all data for reconstruction error Most anomalous idx = [9] Item: [ 0.0000 0.3900 1.0000 0.4710 1.0000] Reconstruction: [ 1.0000 0.2400 0.0000 0.2950 1.0000] Error = 0.7887
The second analyze2() function computes all reconstruction errors and sorts them from largest error to smallest. My demo displays the five most anomalous items:
Top 5 most anomalous items and error: 9 : 0.7887 8 : 0.7121 3 : 0.7115 7 : 0.4974 2 : 0.4139
It was an interesting exploration. I’ve noticed that when I create a non-trivial program, I almost always start by designing separate static functions. Then, after the code is up and running, I refactor to a class structure. I almost never start with a class structure — probably just the way my brain works.

The 1990s had some very strange but wonderfully creative animated cartoon shows on TV. Here are three of my favorites.
Left: “Aaahh!!! Real Monsters” (1994) features three young but nice monsters. Oblina (the smart one, like a weird candy cane), Krumm (a hulking somewhat dim-witted monster who holds his eyes in his hands), and Ickis (the cranky one, sort of a demonic rabbit). They live underneath a garbage dump in a city of monsters and attend their monster school. They have all kinds of bizarre but entertaining adventures.
Center: “Rocko’s Modern Life” (1993) features the surreal life of an Australian wallaby named Rocko and his friends including a naive steer named Heffer Wolfe, a neurotic turtle named Filburt, and Rocko’s enthusiastic dog Spunky. Weirdly wildly entertaining.
Right: “CatDog” (1998) tells stories about the life of conjoined brothers of different species. The cat half is cynical; the dog half is always optimistic.
Here is my updated version that encapsulates the autoencoder functionality into an explicit Autoencoder class. Replace “gt” (greater than) with Boolean operator symbol.
# people_anomaly_lgbm.py
# custom LightGBM autoencoder reconstruction error
# Anaconda3 2023.09-0 Python 3.11.5 LightGBM 4.3.0
import numpy as np
import lightgbm as L
# -----------------------------------------------------------
class Autoencoder():
def __init__(self, dim, n_estimators, min_leaf, lrn_rate):
self.dim = dim
self.n_estimators = n_estimators
self.min_leaf = min_leaf
self.lrn_rate = lrn_rate
self.sub_models = [] # list of LGBMRegressor models
def train(self, data_all):
for j in range(self.dim): # each column
n = len(data_all) # use 80% of rows
all_rows = np.arange(n)
selected_rows = np.random.choice(all_rows,
size=int(n * 0.80), replace=False)
data_partial = data_all[selected_rows,:]
train_x = np.delete(data_partial, j, axis=1)
train_y = data_partial[:, j]
params = {
'objective': 'regression', # not needed
'boosting_type': 'gbdt', # default
'n_estimators': self.n_estimators, # default = 100
'num_leaves': 31, # default
'learning_rate': self.lrn_rate, # default = 0.10
'feature_fraction': 1.0, # default
'min_data_in_leaf': self.min_leaf, # default = 20
'random_state': 0,
'verbosity': -1
}
sub_model = L.LGBMRegressor(**params)
sub_model.fit(train_x, train_y)
self.sub_models.append(sub_model)
def predict(self, x):
# x is 1D
x = x.reshape(1, -1) # 2D for LGBMRegressor.predict()
result = np.zeros(self.dim, dtype=np.float64)
for i in range(self.dim):
xx = np.delete(x, i, axis=1) # peel away target col
pred = self.sub_models[i].predict(xx)
result[i] = pred
return result
# -----------------------------------------------------------
# -----------------------------------------------------------
def analyze(model, data_XY):
n = len(data_XY)
most_anom_idx = 0
most_anom_recon = data_XY[0]
largest_err = 0.0
for i in range(n):
x = data_XY[i]
y = model.predict(x)
err = np.linalg.norm(x-y)
if err "gt" largest_err:
largest_err = err
most_anom_idx = i
most_anom_item = x
print("\nMost anomalous idx = [" + str(most_anom_idx) + "]")
print("Item: ", end="")
print(x)
print("Reconstruction: ", end="")
print(most_anom_recon)
print("Error = %0.4f " % largest_err)
# -----------------------------------------------------------
def analyze2(model, data_XY):
n = len(data_XY)
ids = np.arange(n, dtype=np.int64) # 0, 1, 2, . .
errors = np.zeros(n, dtype=np.float64)
for i in range(n):
x = data_XY[i]
y = model.predict(x)
err = np.linalg.norm(x-y)
errors[i] = err
sorted_error_idxs = np.flip(np.argsort(errors))
sorted_errors = errors[sorted_error_idxs]
sorted_ids = ids[sorted_error_idxs]
print("\nTop 5 most anomalous items and error: ")
for i in range(5):
print(str(sorted_ids[i]) + " : ", end="")
print("%8.4f" % sorted_errors[i])
# -----------------------------------------------------------
def main():
print("\nAnomaly detection using LightGBM autoencoder ")
np.random.seed(1)
np.set_printoptions(precision=4, suppress=True,
floatmode='fixed', sign=" ")
print("\nLoading source data ")
src = ".\\Data\\people_10.txt" # tiny subset
# 1.0000, 0.2400, 0.0000, 0.2950, 1.0000
# 0.0000, 0.3900, 1.0000, 0.5120, 0.5000
# . . .
# sex age State income politics
data_XY = np.loadtxt(src, usecols=[0,1,2,3,4],
delimiter=',', comments="#", dtype=np.float64)
print("\nFirst 3 rows source data: ")
for i in range(3):
print(data_XY[i])
print("\nCreating autoencoder model ")
print("dim = 5 ")
print("n_estimators = 50 ")
print("min_leaf = 2 ")
print("learn_rate = 0.05 ")
ae_model = Autoencoder(5, 50, 2, 0.05)
print("Done ")
print("\nTraining model ")
ae_model.train(data_XY)
print("Done ")
print("\nFirst 3 predicted data items: ")
for i in range(3):
x = data_XY[i] # x is 1D
y = ae_model.predict(x)
print(y)
print("\nAnalyzing all data for reconstruction error ")
analyze(ae_model, data_XY)
analyze2(ae_model, data_XY)
print("\nEnd demo ")
if __name__ == "__main__":
main()

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.