I was exploring positive and unlabeled (PUL) learning. PUL has two phases. In the first phase, you iteratively create and train a binary classification model many times. It makes sense to use a classification model that is simpple and quick, and logistic regression meets those criteria. I took a look at creating a logistic regression model using the PyTorch and the results were OK. But I figured I’d take a look at using the scikit library. The bottom line: the scikit library is convenient but it doesn’t have the flexibility I need for most of my problem scenarios.
I was surprised at how fast the scikit LogisticRegression model trains. The scikit library uses L-BFGS optimization by default. A neural network approach to logistic regression typically uses some form of stochastic gradient descent, which computes a first order Calculus derivative. L-BFGS optimization uses second order Calculus derivatives, which makes it very fast. The downside to L-BFGS is that it requires all training data to be in memory so it isn’t well-suited for very large training datasets.
I used the Banknote Authentication data. It has 1372 data items. Each item represents a digital image of a banknote (think euro or dollar bill) . There are four predictor values followed by a 0 (authentic) or a 1 (forgery).
I fetched the raw data from archive.ics.uci.edu/ml/datasets/banknote+authentication. I added ID numbers from 1 to 1372 (not necessary — just to track items). Then I randomly split the data into a 1097-item set for training and a 275-item set for testing. I divided all four predictor values by 20 to normalize them so that they’d be between -1.0 and +1.0.
In my demo, I loaded data using the Pandas library’s read_csv() function, which seems to be the norm for scikit users. I could have used the NumPy loadtxt() function instead.
Logistic regression is the epitome of classical machine learning and has a mathematical beauty. Logistic regression is simple but there are many different ways to implement it programmatically. The scikit library is good for simple tasks but isn’t good if you need a custom model of some sort.
Cleopatra (69 BC – 30 BC) is the personification of classical beauty. Many people are surprised to learn that Cleopatra wasn’t Egyptian — she was a member of the Greek Ptolemaic dynasty, which conquered Egypt in 332 BC. Cleopatra was a descendant of Alexander the Great. There are many possible interpretations of what she looked like. These three images are likely close. The center image was recovered from the ruins of the city of Herculaneum, which was destroyed by the eruption of Mount Vesuvius in 79 AD (along with the nearby city Pompeii).
# banknote_scikit.py
import pandas as pd
import numpy as np
import pickle
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
# ----------------------------------------------------------
# archive.ics.uci.edu/ml/datasets/banknote+authentication
# IDs 0001 to 1372 added as first column
# data has been k=20 normalized (all four predictor columns)
# ID variance skewness kurtosis entropy class
# [0] [1] [2] [3] [4] [5]
# (0 = authentic, 1 = forgery)
# train: 1097 items, test: 275 items
# ----------------------------------------------------------
def main():
# 0. get started
print("\nBanknote using scikit logistic regression \n")
np.random.seed(1)
# 1. load data
print("\nLoading Banknote data into dataframes ")
train_file = ".\\Data\\banknote_k20_train.txt"
test_file = ".\\Data\\banknote_k20_test.txt"
train_df = pd.read_csv(train_file,
sep="\t", usecols = [1,2,3,4,5], header=None)
train_x = train_df.iloc[:, [0,1,2,3]]
train_y = train_df.iloc[:, 4] # 1-D required
test_df = pd.read_csv(test_file,
sep="\t", usecols = [1,2,3,4,5], header=None)
test_x = test_df.iloc[:, [0,1,2,3]]
test_y = test_df.iloc[:, 4] # 1-D required
# 2. create model
print("\nCreating logistic regression model ")
model = LogisticRegression()
# 3. train model
print("\nStarting training ")
model.fit(train_x, train_y)
print("Done ")
# 4. evaluate model
train_preds = model.predict(train_x)
acc_train = metrics.accuracy_score(train_y, train_preds)
print("\nAccuracy on train data = %0.2f%%" % \
(acc_train * 100))
test_preds = model.predict(test_x)
acc_test = metrics.accuracy_score(test_y, test_preds)
print("Accuracy on test data = %0.2f%%" % \
(acc_test * 100))
# 5. save model
print("\nSaving trained logistic regression model \n")
path = ".\\Models\\banknote_scikit_model.sav"
pickle.dump(model, open(path, "wb"))
# 6. make a prediction
raw_inpt = np.array([[4.4, 1.8, -5.6, 3.2]],
dtype=np.float32)
norm_inpt = raw_inpt / 20
print("Setting normalized inputs to:")
for x in norm_inpt[0]:
print("%0.3f " % x, end="")
p = model.predict_proba(norm_inpt) # [[0.51 0.49]]
p = p[0][1] # first (only) row, second value
print("\nPrediction prob = %0.6f " % p)
if p "less-than" 0.5:
print("Prediction = authentic")
else:
print("Prediction = forgery")
print("\nEnd demo ")
if __name__== "__main__":
main()


.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
You must be logged in to post a comment.