The scikit Diabetes Dataset is Essentially Useless - Good Predictions Not Possible

Bottom line: The title of this post is unintentional click-bait. The default target-to-predict diabetes score in column [10] of the scikit Diabetes Dataset cannot be predicted meaningfully. But the variables in columns [4], [5], [6], [7], and [8] can be meaningfully predicted from the other columns.

I have worked with machine learning for decades. In 2010, the scikit-learn library was published and it contained several datasets for use in examples. One of these popular datasets is the Diabetes Dataset for regression models.

I was aware of the Diabetes Dataset but never looked at it until recently. I was more than just a little bit surprised to discover that the dataset is essentially useless — the target y values just cannot be predicted with any meaningful accuracy.

But, the dataset is useful if a different variable is specified as the target y to predict

The Diabetes Dataset looks like:

59, 2, 32.1, 101.00, 157,  93.2, 38, 4.00, 4.8598, 87, 151
48, 1, 21.6,  87.00, 183, 103.2, 70, 3.00, 3.8918, 69,  75
72, 2, 30.5,  93.00, 156,  93.6, 41, 4.00, 4.6728, 85, 141
. . .

The dataset has 442 items. Each item represents a patient and has 10 predictor values followed by a target value to predict. The 10 predictor variables are age in column [0], sex [1], body mass index [2], blood pressure [3], serum cholesterol [4], low-density lipoproteins [5], high-density lipoproteins [6], total cholesterol [7], triglycerides [8], blood sugar [9]. The stated (in the documentation) target value to predict in the last column is a measure of diabetes [10].

Note: The sex encoding isn’t explained but I suspect male = 1, female = 2 because there are 235 1 values and 206 2 values).

I ran the Diabetes Dataset through six models: 1.) linear regression, 2.) k-nearest neighbors regression, 3.) kernel ridge regression, 4.) neural network (MLP) regression, 5.) random forest regression, and 6.) gradient boosting regression. None of the models gave any meaningful prediction results for the diabetes score. The output of my experiment:

Begin Diabetes Dataset exploration

Loading diabetes train (342), test (100) data
Done

First three X predictors:
[ 59.0000   2.0000  32.1000 . . . 87.0000]
[ 48.0000   1.0000  21.6000 . . . 69.0000]
[ 72.0000   2.0000  30.5000 . . . 85.0000]

First three y targets:
151.0000
 75.0000
141.0000

==========
Linear Regression

acc train (0.10) = 0.1813
acc test (0.10) = 0.2800
==========

==========
Nearest Neighbors k=4

acc train (0.10) = 0.1988
acc test (0.10) = 0.1400
==========

==========
Kernel Ridge Regression RBF gamma=0.005, alpha=0.01

acc train (0.10) = 0.9971
acc test (0.10) = 0.0700
==========

==========
MLPRegressor max_iter=2000

acc train (0.10) = 0.1520
acc test (0.10) = 0.2200
==========

==========
Random Forest Regression depth=5

acc train (0.10) = 0.2368
acc test (0.10) = 0.1700
==========

==========
Gradient Boost Regression depth=5, lrn_rate=0.10

acc train (0.10) = 0.7807
acc test (0.10) = 0.1800
==========

The models either severely underfit (linear regression, nearest neighbors, neural network, random forest), or severely overfit (kernel ridge regression, gradient boosting regression). For the accuracy values, a prediction is scored as correct if it is within 10% of the true target value.

Note: I re-encoded the sex predictor from 1,2 to 0,1. I normalized the other predictor values using divide-by-constant normalization (100, 1, 100, 1000, 1000, 1000, 100, 10, 10, 1000, 1000). I normalized the target y values by dividing by 1000. I used the first 342 data items for training and the remaining 100 items for testing. I experimented with other splitting ideas and normalization techniques, but none had any significant effect.

So, the scikit Diabetes Dataset is an unintentional hoax. The default diabetes score in the dataset cannot be meaningfully predicted.

However, after a bit of experimentation, I discovered that columns [4], [5], [6], [7], and [8] can be predicted meaningfully. If you use the built-in load_diabetes() function with a return_X_y=True parameter, column [10] is automatically the target to predict. But you can load the data as a DataFrame and then specify a different column as the target. Alternatively, you can use preprocessed training data.

Here’s partial output when column [4] (serum cholesterol) is set as the target-to-predict:

==========
Linear Regression

acc train (0.10) = 0.9942
acc test (0.10) = 0.9800
==========

==========
Nearest Neighbors k=4

acc train (0.10) = 0.8860
acc test (0.10) = 0.7900
==========

==========
Random Forest Regression depth=5

acc train (0.10) = 0.9766
acc test (0.10) = 0.9400
==========

So, the title of this post is somewhat click-bait. The scikit Diabetes Dataset is useful for experimenting with regression models, but only when the default diabetes score in column [10] is not used as the target y.

The inclusion of the Diabetes Dataset in the scikit library collection of datasets is an unintentional hoax. But financial fraud is a different story. Stereotypes exist for a reason.

Demo program. Replace the “lt” in the accuracy() function with the less-than operator.

# diabetes_scikit.py
# various techniques for the Diabetes Dataset

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import neighbors
from sklearn.kernel_ridge import KernelRidge
from sklearn.neural_network import MLPRegressor
from sklearn import ensemble

np.set_printoptions(precision=4, suppress=True,
  floatmode='fixed', linewidth=120)

# -----------------------------------------------------------

def accuracy(model, data_X, data_y, pct_close):
  n = len(data_X)
  n_correct = 0; n_wrong = 0
  for i in range(n):
    x = data_X[i].reshape(1,-1)
    y = data_y[i]
    y_pred = model.predict(x)

    # replace the "lt" with less-than symbol
    if np.abs(y - y_pred) "lt" np.abs(y * pct_close):
      n_correct += 1
    else: 
      n_wrong += 1
  # print("Correct = " + str(n_correct))
  # print("Wrong   = " + str(n_wrong))
  return n_correct / (n_correct + n_wrong)

# -----------------------------------------------------------

def MSE(model, data_X, data_y):
  n = len(data_X)
  sum = 0.0
  for i in range(n):
    x = data_X[i].reshape(1,-1)
    y = data_y[i]
    y_pred = model.predict(x)[0]
    # print(y_pred); input()
    sum += (y - y_pred) * (y - y_pred)

  return sum / n

# -----------------------------------------------------------

print("\nBegin Diabetes Dataset exploration ")

# use pre-normalized data
print("\nLoading diabetes train (342), test (100) data ")

cols_X = [0,1,2,3,4,5,6,7,8,9]  # predictors
col_y = 10  # target. cols 4 5 6 7 8 are much better
train_X = np.loadtxt(train_file, comments="#",
  usecols=cols_X,
  delimiter=",",  dtype=np.float64)
train_y = np.loadtxt(train_file, comments="#", usecols=col_y,
  delimiter=",",  dtype=np.float64)

test_file = ".\\Data\\diabetes_test_100.txt"
test_X = np.loadtxt(test_file, comments="#",
  usecols=cols_X,
  delimiter=",",  dtype=np.float64)
test_y = np.loadtxt(test_file, comments="#", usecols=col_y,
  delimiter=",",  dtype=np.float64)
print("Done ")

# use built-in diabetes data for
#   alternative normalization and split.
# from sklearn.datasets import load_diabetes
# from sklearn.model_selection import train_test_split
# X, y = load_diabetes(return_X_y=True, scaled=True)
# train_X, test_X, train_y, test_y = \
#   train_test_split(X, y, random_state=0)  # 25% test

print("\nFirst three X predictors: ")
for i in range(3):
  print(train_X[i])
print("\nFirst three y targets: ")
for i in range(3):
  print("%0.4f" % train_y[i])

print("\n========== ")
print("Linear Regression ")
model = LinearRegression()
model.fit(train_X, train_y)
acc_train = accuracy(model, train_X, train_y, 0.10)
print("\nacc train (0.10) = %0.4f " % acc_train)
acc_test = accuracy(model, test_X, test_y, 0.10)
print("acc test (0.10) = %0.4f " % acc_test)
print("========== ")

print("\n========== ")
print("Nearest Neighbors k=4")
model = neighbors.KNeighborsRegressor(4)
model.fit(train_X, train_y)
acc_train = accuracy(model, train_X, train_y, 0.10)
print("\nacc train (0.10) = %0.4f " % acc_train)
acc_test = accuracy(model, test_X, test_y, 0.10)
print("acc test (0.10) = %0.4f " % acc_test)
print("========== ")

print("\n========== ")
print("Kernel Ridge Regression RBF gamma=0.005, alpha=0.01")
model = KernelRidge(kernel='rbf', gamma=0.005, alpha=0.01)
model.fit(train_X, train_y)
acc_train = accuracy(model, train_X, train_y, 0.10)
print("\nacc train (0.10) = %0.4f " % acc_train)
acc_test = accuracy(model, test_X, test_y, 0.10)
print("acc test (0.10) = %0.4f " % acc_test)
print("========== ")

print("\n========== ")
print("MLPRegressor max_iter=2000")
model = MLPRegressor(random_state=0, max_iter=2000)
model.fit(train_X, train_y)
acc_train = accuracy(model, train_X, train_y, 0.10)
print("\nacc train (0.10) = %0.4f " % acc_train)
acc_test = accuracy(model, test_X, test_y, 0.10)
print("acc test (0.10) = %0.4f " % acc_test)
print("========== ")

print("\n========== ")
print("Random Forest Regression depth=5")
params = { "n_estimators": 100, "max_depth": 5 }
model = ensemble.RandomForestRegressor(**params)
model.fit(train_X, train_y)
acc_train = accuracy(model, train_X, train_y, 0.10)
print("\nacc train (0.10) = %0.4f " % acc_train)
acc_test = accuracy(model, test_X, test_y, 0.10)
print("acc test (0.10) = %0.4f " % acc_test)
print("========== ")

print("\n========== ")
print("Gradient Boost Regression depth=5, lrn_rate=0.10")
params = {
"n_estimators": 100, "max_depth": 5, "learning_rate": 0.10,
}
model = ensemble.GradientBoostingRegressor(**params)
model.fit(train_X, train_y)
acc_train = accuracy(model, train_X, train_y, 0.10)
print("\nacc train (0.10) = %0.4f " % acc_train)
acc_test = accuracy(model, test_X, test_y, 0.10)
print("acc test (0.10) = %0.4f " % acc_test)
print("========== ")