I Get Tricked By the Pollen Dataset

In a nutshell: I found a dataset, “529_pollen”, that I thought would make a nice challenge for a regression (predict a single numeric value) prediction model. But after a few hours of thrashing around, I discovered that the dataset is basically just random values, and so I had wasted my time.

It all started when I was looking for a dataset to try some new ideas for a regression prediction model. After searching the Internet for a bit, I came across the “Pollen Dataset”, aka “529_pollen”, at http://www.openml.org/d/529. I gave the data description only a quick glance (this is where I went wrong — I should have looked more closely) and dove in.

The dataset was generated synthetically, and has 3,848 rows. Each row has six values and looks like this:

-2.3482,  3.6314,  5.0289, 10.8721, -1.3852, 1
-1.1520,  1.4805,  3.2375, -0.5939,  2.1235, 2
-2.5245, -6.8633, -2.8037,  8.4631, -3.4126, 3
. . .

The first four values on each row are the predictors: “ridge”, “nub”, “crack”, “weight”. The fifth value is “density”, the value to predict. The sixth value is a 1-based ID.

I randomly split the data into a 2,886-item set for training (75%) and a 962-item set for testing (25%). To get ready, I ran the data through a scikit LinearRegression model and a scikit GradientBoostingRegressor model. Both models scored about 25% accuracy on the training and test data (a prediction within 0.20 of the true target y value).

“That’s odd”, I thought. I expected a poor accuracy using linear regression but a much higher accuracy using gradient boosting regression. But I confidently created a PyTorch regression model and got . . . about 25% accuracy.

I spent a couple of frustrating hours trying to fine-tune my PyTorch model, but nothing seemed to work.

Only at this point did I take a closer look at the data description. To make a long story short, the data is essentially just random noise and there’s no reason to believe that any machine learning technique can score high accuracy. Arg.

Lesson learned: machine learning prediction models always begin by understanding the data.

I don’t enjoy getting tricked by synthetic datasets. But I love magic tricks, especially those that rely on a “gimmick” — a physical device of some sort (as opposed to tricks that rely on sleight-of-hand).

Here’s an example of a homemade Card Vanish gimmick. The magician has a frame with three horizontal windows. He puts a playing card into the frame, and after the frame is covered for an instant, the card vanishes. The trick depends on a gimmicked card with two horizontal windows. When the gimmicked card slides down the frame, the windows line up, and the card part is hidden by the frame bars, and the card appears to vanish.

Demo program. Replace “lt” (less-than) with Boolean operator symbol (my blog editor chokes on symbols).

# pollen.py
# data from https://www.openml.org/d/529
# predict density from ridge, nub, crack, weight
# PyTorch 2.1.2-CPU Anaconda3-2023.09-1  Python 3.11.5

import numpy as np
import torch as T

device = T.device('cpu') 

# -----------------------------------------------------------

class PollenDataset(T.utils.data.Dataset):
  def __init__(self, src_file):
    # ridge, nub, crack, weight, density, ID
    # -2.3482, 3.6314, 5.0289,  10.8721, -1.3852, 1
    # -1.1520, 1.4805, 3.2375, -0.5939,   2.1235, 2

    # double-read technique. ignore ID column
    tmp_x = np.loadtxt(src_file, usecols=[0,1,2,3],
      delimiter=",", comments="#", dtype=np.float32)
    tmp_y = np.loadtxt(src_file, usecols=4, delimiter=",",
      comments="#", dtype=np.float32)
    tmp_y = tmp_y.reshape(-1,1)  # 2D required

    # single-read approach
    # tmp_xy = np.loadtxt(src_file, usecols=[0,1,2,3,4],
    #   delimiter=",", comments="#", dtype=np.float32)
    # tmp_x = tmp_xy[:,[0,1,2,3]]
    # tmp_y = tmp_xy[:,[4]]  # already 2D

    # normalize by divide by 100.0
    tmp_x /= 100.0
    tmp_y /= 100.0

    self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y, dtype=T.float32).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    trgt = self.y_data[idx] 
    return (preds, trgt)  # as a tuple

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 100)  # 4-(100-100)-1
    self.hid2 = T.nn.Linear(100, 100)
    self.oupt = T.nn.Linear(100, 1)

    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = self.oupt(z)  # regression: no activation
    return z

# -----------------------------------------------------------

def accuracy(model, ds, pct_close):
  # assumes model.eval()
  # correct within pct of true income
  n_correct = 0; n_wrong = 0

  for i in range(len(ds)):
    X = ds[i][0]   # 2-d
    Y = ds[i][1]   # 2-d
    with T.no_grad():
      oupt = model(X)   # computed densities

    if T.abs(oupt - Y) "lt" T.abs(pct_close * Y):
      n_correct += 1
    else:
      n_wrong += 1
  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def train(model, ds, bs, lr, me, le):
  # dataset, bat_size, lrn_rate, max_epochs, log interval
  train_ldr = T.utils.data.DataLoader(ds, batch_size=bs,
    shuffle=True)
  loss_func = T.nn.MSELoss()
  optimizer = T.optim.Adam(model.parameters(), lr=lr)

  for epoch in range(0, me):
    epoch_loss = 0.0  # for one full epoch
    for (b_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # predictors
      y = batch[1]  # target income
      optimizer.zero_grad()
      oupt = model(X)
      loss_val = loss_func(oupt, y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()  # compute gradients
      optimizer.step()     # update weights

    if epoch % le == 0:
      print("epoch = %4d  |  loss = %0.4f" % \
        (epoch, epoch_loss)) 

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin Pollen predict density ")
  T.manual_seed(0)
  np.random.seed(0)
  
  # 1. create Dataset objects
  print("\nCreating Pollen Dataset objects ")
  train_file = ".\\Data\\pollen_train.txt"
  train_ds = PollenDataset(train_file)  # 2886 rows

  test_file = ".\\Data\\pollen_test.txt"
  test_ds = PollenDataset(test_file)  # 962 rows

  # 2. create network
  print("\nCreating 4-(100-100)-1 neural network ")
  net = Net().to(device)

# -----------------------------------------------------------

  # 3. train model
  print("\nbat_size = 10 ")
  print("loss = MSELoss() ")
  print("optimizer = Adam ")
  print("lrn_rate = 0.001 ")

  print("\nStarting training")
  net.train()
  train(net, train_ds, bs=10, lr=0.001, me=50, le=5)
  print("Done ")

# -----------------------------------------------------------

  # 4. evaluate model accuracy
  print("\nComputing model accuracy (within 0.20 of true) ")
  net.eval()
  acc_train = accuracy(net, train_ds, 0.20)  # item-by-item
  print("Accuracy on train data = %0.4f" % acc_train)

  acc_test = accuracy(net, test_ds, 0.20)  # item-by-item
  print("Accuracy on test data = %0.4f" % acc_test)

# -----------------------------------------------------------

  # 5. make a prediction
  print("\nPredicting for 2.7650, -0.0854, 9.6972, -0.5078: ")
  # actual y = 0.3527
  x = np.array([[2.7650, -0.0854, 9.6972, -0.5078]],
    dtype=np.float32)
  x /= 100.0  # normalize
  x = T.tensor(x, dtype=T.float32).to(device) 

  with T.no_grad():
    pred_y = net(x)
  pred_y = pred_y.item()  # scalar
  print("%0.4f" % (pred_y * 100.0))  # de-normalized

# -----------------------------------------------------------

  # 6. save model (state_dict approach)
  # print("\nSaving trained model state")
  # fn = ".\\Models\\pollen_density_model.pt"
  # T.save(net.state_dict(), fn)

  # model = Net()
  # model.load_state_dict(T.load(fn))
  # use model to make prediction(s)

  print("\nEnd Pollen density demo ")

if __name__ == "__main__":
  main()

1 Response to I Get Tricked By the Pollen Dataset

Thorsten Kleppe says:

November 1, 2024 at 5:38 am

The first image shows the default parallel coordinates based on the density of the Pollen dataset. The second image highlights the borders. The last image shows the default data distribution.

Loading...