Logistic Regression Using The scikit Library

I was exploring positive and unlabeled (PUL) learning. PUL has two phases. In the first phase, you iteratively create and train a binary classification model many times. It makes sense to use a classification model that is simpple and quick, and logistic regression meets those criteria. I took a look at creating a logistic regression model using the PyTorch and the results were OK. But I figured I’d take a look at using the scikit library. The bottom line: the scikit library is convenient but it doesn’t have the flexibility I need for most of my problem scenarios.

I was surprised at how fast the scikit LogisticRegression model trains. The scikit library uses L-BFGS optimization by default. A neural network approach to logistic regression typically uses some form of stochastic gradient descent, which computes a first order Calculus derivative. L-BFGS optimization uses second order Calculus derivatives, which makes it very fast. The downside to L-BFGS is that it requires all training data to be in memory so it isn’t well-suited for very large training datasets.

I used the Banknote Authentication data. It has 1372 data items. Each item represents a digital image of a banknote (think euro or dollar bill) . There are four predictor values followed by a 0 (authentic) or a 1 (forgery).

I fetched the raw data from archive.ics.uci.edu/ml/datasets/banknote+authentication. I added ID numbers from 1 to 1372 (not necessary — just to track items). Then I randomly split the data into a 1097-item set for training and a 275-item set for testing. I divided all four predictor values by 20 to normalize them so that they’d be between -1.0 and +1.0.

In my demo, I loaded data using the Pandas library’s read_csv() function, which seems to be the norm for scikit users. I could have used the NumPy loadtxt() function instead.

Logistic regression is the epitome of classical machine learning and has a mathematical beauty. Logistic regression is simple but there are many different ways to implement it programmatically. The scikit library is good for simple tasks but isn’t good if you need a custom model of some sort.

Cleopatra (69 BC – 30 BC) is the personification of classical beauty. Many people are surprised to learn that Cleopatra wasn’t Egyptian — she was a member of the Greek Ptolemaic dynasty, which conquered Egypt in 332 BC. Cleopatra was a descendant of Alexander the Great. There are many possible interpretations of what she looked like. These three images are likely close. The center image was recovered from the ruins of the city of Herculaneum, which was destroyed by the eruption of Mount Vesuvius in 79 AD (along with the nearby city Pompeii).

# banknote_scikit.py

import pandas as pd
import numpy as np
import pickle

from sklearn import metrics 
from sklearn.linear_model import LogisticRegression

# ----------------------------------------------------------

# archive.ics.uci.edu/ml/datasets/banknote+authentication
# IDs 0001 to 1372 added as first column
# data has been k=20 normalized (all four predictor columns)
# ID  variance  skewness  kurtosis  entropy  class
# [0]    [1]      [2]       [3]       [4]     [5]
#  (0 = authentic, 1 = forgery) 
# train: 1097 items, test: 275 items

# ----------------------------------------------------------

def main():
  # 0. get started
  print("\nBanknote using scikit logistic regression \n")
  np.random.seed(1)

  # 1. load data
  print("\nLoading Banknote data into dataframes ")

  train_file = ".\\Data\\banknote_k20_train.txt"
  test_file = ".\\Data\\banknote_k20_test.txt"

  train_df = pd.read_csv(train_file, 
    sep="\t", usecols = [1,2,3,4,5], header=None)
  train_x = train_df.iloc[:, [0,1,2,3]]
  train_y = train_df.iloc[:, 4]  # 1-D required

  test_df = pd.read_csv(test_file, 
    sep="\t", usecols = [1,2,3,4,5], header=None)
  test_x = test_df.iloc[:, [0,1,2,3]]
  test_y = test_df.iloc[:, 4]  # 1-D required

  # 2. create model
  print("\nCreating logistic regression model ")
  model = LogisticRegression()

  # 3. train model
  print("\nStarting training ")
  model.fit(train_x, train_y)
  print("Done ")

  # 4. evaluate model
  train_preds = model.predict(train_x)
  acc_train = metrics.accuracy_score(train_y, train_preds)
  print("\nAccuracy on train data = %0.2f%%" % \
    (acc_train * 100))

  test_preds = model.predict(test_x)
  acc_test = metrics.accuracy_score(test_y, test_preds)
  print("Accuracy on test data = %0.2f%%" % \
    (acc_test * 100))

  # 5. save model
  print("\nSaving trained logistic regression model \n")
  path = ".\\Models\\banknote_scikit_model.sav"
  pickle.dump(model, open(path, "wb"))

  # 6. make a prediction 
  raw_inpt = np.array([[4.4, 1.8, -5.6, 3.2]],
    dtype=np.float32)
  norm_inpt = raw_inpt / 20

  print("Setting normalized inputs to:")
  for x in norm_inpt[0]:
    print("%0.3f " % x, end="")
  
  p = model.predict_proba(norm_inpt)  #  [[0.51 0.49]]
  p = p[0][1]  # first (only) row, second value

  print("\nPrediction prob = %0.6f " % p)
  if p "less-than" 0.5:
    print("Prediction = authentic")
  else:
    print("Prediction = forgery")

  print("\nEnd demo ")

if __name__== "__main__":
  main()