"Binary Classification Using LightGBM" in Visual Studio Magazine

I wrote an article titled “Binary Classification Using LightGBM” in the July 2024 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/Articles/2024/07/01/lightgbm-classification.aspx.

LightGBM (lightweight gradient boosted machine) is a sophisticated, open-source, tree-based system that was introduced in 2017. LightGBM can perform binary classification , multi-class classification, (predict one of three or more possible values), regression (predict a single numeric value) and ranking.

The article presents a complete-end-to-end demo program. LightGBM has three programming language interfaces — C, Python and R. The demo program uses the Python language API.

I used one of my standard synthetic datasets. The raw data looks like:

F  24  michigan  29500.00  liberal
M  39  oklahoma  51200.00  moderate
F  63  nebraska  75800.00  conservative
M  36  michigan  44500.00  moderate
F  27  nebraska  28600.00  liberal
. . .

The goal is to predict a person’s sex (M = 0, F = 1) from age, state of residence, annual income, and political leaning. Because LightGBM is a tree-based system, you must encode the categorical data using zero-based ordinal encoding, but numeric data can be used as-is without normalization. The encoded data looks lke:

1, 24, 0, 29500.00, 2
0, 39, 2, 51200.00, 1
1, 63, 1, 75800.00, 0
0, 36, 0, 44500.00, 1
1, 27, 1, 28600.00, 2
. . .

After the data is loaded into memory, the key statements are:

  print("Creating LGBM binary classification model ")
  params = {
    'objective': 'binary',    # not needed this API
    'boosting_type': 'gbdt',  # default
    'num_leaves': 31,         # default
    'learning_rate': 0.05,    # default = 0.10
    'n_estimators': 100,      # default
    'feature_fraction': 1.0,  # default
    'min_data_in_leaf': 10,   # default = 20
    'random_state': 0,
    'verbosity': -1           # default = 1
  }
  model = lgbm.LGBMClassifier(**params)
  model.fit(train_x, train_y)

The trained model predicts the training data with 97.00 percent accuracy (194 out of 200 correct) and predicts the test data with 72.5 percent accuracy (29 out of 40 correct).

The main challenge when using LightGBM is wading through the dozens of parameters. The LGBMClassifier class/object has 19 parameters (num_leaves, max_depth, etc.) and behind the scenes there are an additional 57 Learning Control Parameters (min_data_in_leaf, bagging_fraction, etc.), for a total of 76 parameters to deal with.

Because the number of parameters is not manageable, you must rely on the default values and then try to find the handful of parameters that will create a good model. Based on my experience, the three most important parameters to explore and modify are n_estimators, min_data_in_leaf and learning_rate. The article gives some of the rules of thumb I use for choosing parameter values.

As AI-generated images get better and better, it’s becoming increasingly difficult to classify them as real or artificial. Here are three nice AI-generated alien insects. I like generated images of alien life to resemble real life, but be just different enough to make them seem alien in some way.