Doing Logistic Regression With The scikit Library

Logistic regression (LR) is one of the most fundamental machine learning techniques. LR is designed do binary classification. A typical example is predicting if a person is male (class 0) or female (class 1) based on predictor variables like age, income, years of education, and so on.

When I do logistic regression, I usually code an implementation from scratch because it gives me full control, and allows me to include only what I need and nothing extra.

Besides implementing LR from scratch, there are many machine learning code libraries available. One of the most popular libraries is scikit-learn (or just scikit for short). It’s a Python library. Because I use the Anaconda distribution of Python, I already had scikit installed so I thought I’d explore LR using scikit.

Here’s a graph of my demo data:

Notice that the data is not linearly separable, which means LR cannot make a good model. I did this on purpose to emphasize the point the LR is a simple technique that can’t handle complex non-linearly separable data. However, in realistic scenarios, you don’t know if data is linearly separable until after you run logistic regression.

My demo data has just two predictor variables only so I can graph it, but LR can handle any number of predictors. Notice that there are a total of 21 training items — 12 class 0 items (57%) and 9 class 1 items (43%). Therefore, just by always predicting class 0, you would get 57% accuracy on the test data. This is what happened, because the data is not linearly separable, not because scikit didn’t work well.

Good fun. My demo program code is listed below.


# lr_skl_demo.py
# logistic regression with scikit demo

import numpy as np
from sklearn.linear_model import LogisticRegression

train_X = np.array([
[0.2, 0.3], [0.1, 0.5], [0.2, 0.7], 
[0.3, 0.2], [0.3, 0.8], [0.4, 0.2], 
[0.4, 0.8], [0.5, 0.2], [0.5, 0.8], 
[0.6, 0.3], [0.7, 0.5], [0.6, 0.7], 
[0.3, 0.4], [0.3, 0.5], [0.3, 0.6], 
[0.4, 0.4], [0.4, 0.5], [0.4, 0.6], 
[0.5, 0.4], [0.5, 0.5], [0.5, 0.6]])

train_y = np.array([0,0,0, 0,0,0, 0,0,0, 0,0,0,
 1,1,1, 1,1,1, 1,1,1])

print("\nTraining data:")
print(train_X)
print("")
print(train_y)

print("\nTraining logistic regression model")
model = LogisticRegression(random_state=0).fit(train_X, train_y)

acc = model.score(train_X, train_y)
print("\nAccuracy of trained model = ")
print(acc)

pred = model.predict_proba(train_X[0:1, :]) # row 0, all cols
print("\nPrediction probs for first item:")
print(pred)

# unk = np.array([[0.15, 0.45]])
# pred = model.predict_proba(unk)
# print("\nPrediction probs for new (0.15, 0.45):")
# print(pred)

print("\nModel weights and bias:")
print(model.coef_)
print(model.intercept_)

# z = (model.coef_[0} * 0.2) + (model.coef_[1} * 0.3) +
#   model.intercept_
# p = 1.0 / (1.0 + np.exp(-z))
# print(p)  # prob (0.2, 0.3) item is class 1

print("\nEnd LR demo \n")

Three images from an Internet search for “scratch” I have no idea why these pictures came up but they’re interesting. I think the middle one is actress Angelia Jolie from the movie “Maleficent” (2014). I can’t be sure if the image on the left is a real person or a mannequin. I’m pretty sure the guy on the right is not a software engineer. Thank you Internet.