In a nutshell: I found a dataset, “529_pollen”, that I thought would make a nice challenge for a regression (predict a single numeric value) prediction model. But after a few hours of thrashing around, I discovered that the dataset is basically just random values, and so I had wasted my time.
It all started when I was looking for a dataset to try some new ideas for a regression prediction model. After searching the Internet for a bit, I came across the “Pollen Dataset”, aka “529_pollen”, at http://www.openml.org/d/529. I gave the data description only a quick glance (this is where I went wrong — I should have looked more closely) and dove in.
The dataset was generated synthetically, and has 3,848 rows. Each row has six values and looks like this:
-2.3482, 3.6314, 5.0289, 10.8721, -1.3852, 1 -1.1520, 1.4805, 3.2375, -0.5939, 2.1235, 2 -2.5245, -6.8633, -2.8037, 8.4631, -3.4126, 3 . . .
The first four values on each row are the predictors: “ridge”, “nub”, “crack”, “weight”. The fifth value is “density”, the value to predict. The sixth value is a 1-based ID.
I randomly split the data into a 2,886-item set for training (75%) and a 962-item set for testing (25%). To get ready, I ran the data through a scikit LinearRegression model and a scikit GradientBoostingRegressor model. Both models scored about 25% accuracy on the training and test data (a prediction within 0.20 of the true target y value).
“That’s odd”, I thought. I expected a poor accuracy using linear regression but a much higher accuracy using gradient boosting regression. But I confidently created a PyTorch regression model and got . . . about 25% accuracy.
I spent a couple of frustrating hours trying to fine-tune my PyTorch model, but nothing seemed to work.
Only at this point did I take a closer look at the data description. To make a long story short, the data is essentially just random noise and there’s no reason to believe that any machine learning technique can score high accuracy. Arg.
Lesson learned: machine learning prediction models always begin by understanding the data.

I don’t enjoy getting tricked by synthetic datasets. But I love magic tricks, especially those that rely on a “gimmick” — a physical device of some sort (as opposed to tricks that rely on sleight-of-hand).
Here’s an example of a homemade Card Vanish gimmick. The magician has a frame with three horizontal windows. He puts a playing card into the frame, and after the frame is covered for an instant, the card vanishes. The trick depends on a gimmicked card with two horizontal windows. When the gimmicked card slides down the frame, the windows line up, and the card part is hidden by the frame bars, and the card appears to vanish.
Demo program. Replace “lt” (less-than) with Boolean operator symbol (my blog editor chokes on symbols).
# pollen.py
# data from https://www.openml.org/d/529
# predict density from ridge, nub, crack, weight
# PyTorch 2.1.2-CPU Anaconda3-2023.09-1 Python 3.11.5
import numpy as np
import torch as T
device = T.device('cpu')
# -----------------------------------------------------------
class PollenDataset(T.utils.data.Dataset):
def __init__(self, src_file):
# ridge, nub, crack, weight, density, ID
# -2.3482, 3.6314, 5.0289, 10.8721, -1.3852, 1
# -1.1520, 1.4805, 3.2375, -0.5939, 2.1235, 2
# double-read technique. ignore ID column
tmp_x = np.loadtxt(src_file, usecols=[0,1,2,3],
delimiter=",", comments="#", dtype=np.float32)
tmp_y = np.loadtxt(src_file, usecols=4, delimiter=",",
comments="#", dtype=np.float32)
tmp_y = tmp_y.reshape(-1,1) # 2D required
# single-read approach
# tmp_xy = np.loadtxt(src_file, usecols=[0,1,2,3,4],
# delimiter=",", comments="#", dtype=np.float32)
# tmp_x = tmp_xy[:,[0,1,2,3]]
# tmp_y = tmp_xy[:,[4]] # already 2D
# normalize by divide by 100.0
tmp_x /= 100.0
tmp_y /= 100.0
self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
self.y_data = T.tensor(tmp_y, dtype=T.float32).to(device)
def __len__(self):
return len(self.x_data)
def __getitem__(self, idx):
preds = self.x_data[idx]
trgt = self.y_data[idx]
return (preds, trgt) # as a tuple
# -----------------------------------------------------------
class Net(T.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hid1 = T.nn.Linear(4, 100) # 4-(100-100)-1
self.hid2 = T.nn.Linear(100, 100)
self.oupt = T.nn.Linear(100, 1)
T.nn.init.xavier_uniform_(self.hid1.weight)
T.nn.init.zeros_(self.hid1.bias)
T.nn.init.xavier_uniform_(self.hid2.weight)
T.nn.init.zeros_(self.hid2.bias)
T.nn.init.xavier_uniform_(self.oupt.weight)
T.nn.init.zeros_(self.oupt.bias)
def forward(self, x):
z = T.tanh(self.hid1(x))
z = T.tanh(self.hid2(z))
z = self.oupt(z) # regression: no activation
return z
# -----------------------------------------------------------
def accuracy(model, ds, pct_close):
# assumes model.eval()
# correct within pct of true income
n_correct = 0; n_wrong = 0
for i in range(len(ds)):
X = ds[i][0] # 2-d
Y = ds[i][1] # 2-d
with T.no_grad():
oupt = model(X) # computed densities
if T.abs(oupt - Y) "lt" T.abs(pct_close * Y):
n_correct += 1
else:
n_wrong += 1
acc = (n_correct * 1.0) / (n_correct + n_wrong)
return acc
# -----------------------------------------------------------
def train(model, ds, bs, lr, me, le):
# dataset, bat_size, lrn_rate, max_epochs, log interval
train_ldr = T.utils.data.DataLoader(ds, batch_size=bs,
shuffle=True)
loss_func = T.nn.MSELoss()
optimizer = T.optim.Adam(model.parameters(), lr=lr)
for epoch in range(0, me):
epoch_loss = 0.0 # for one full epoch
for (b_idx, batch) in enumerate(train_ldr):
X = batch[0] # predictors
y = batch[1] # target income
optimizer.zero_grad()
oupt = model(X)
loss_val = loss_func(oupt, y) # a tensor
epoch_loss += loss_val.item() # accumulate
loss_val.backward() # compute gradients
optimizer.step() # update weights
if epoch % le == 0:
print("epoch = %4d | loss = %0.4f" % \
(epoch, epoch_loss))
# -----------------------------------------------------------
def main():
# 0. get started
print("\nBegin Pollen predict density ")
T.manual_seed(0)
np.random.seed(0)
# 1. create Dataset objects
print("\nCreating Pollen Dataset objects ")
train_file = ".\\Data\\pollen_train.txt"
train_ds = PollenDataset(train_file) # 2886 rows
test_file = ".\\Data\\pollen_test.txt"
test_ds = PollenDataset(test_file) # 962 rows
# 2. create network
print("\nCreating 4-(100-100)-1 neural network ")
net = Net().to(device)
# -----------------------------------------------------------
# 3. train model
print("\nbat_size = 10 ")
print("loss = MSELoss() ")
print("optimizer = Adam ")
print("lrn_rate = 0.001 ")
print("\nStarting training")
net.train()
train(net, train_ds, bs=10, lr=0.001, me=50, le=5)
print("Done ")
# -----------------------------------------------------------
# 4. evaluate model accuracy
print("\nComputing model accuracy (within 0.20 of true) ")
net.eval()
acc_train = accuracy(net, train_ds, 0.20) # item-by-item
print("Accuracy on train data = %0.4f" % acc_train)
acc_test = accuracy(net, test_ds, 0.20) # item-by-item
print("Accuracy on test data = %0.4f" % acc_test)
# -----------------------------------------------------------
# 5. make a prediction
print("\nPredicting for 2.7650, -0.0854, 9.6972, -0.5078: ")
# actual y = 0.3527
x = np.array([[2.7650, -0.0854, 9.6972, -0.5078]],
dtype=np.float32)
x /= 100.0 # normalize
x = T.tensor(x, dtype=T.float32).to(device)
with T.no_grad():
pred_y = net(x)
pred_y = pred_y.item() # scalar
print("%0.4f" % (pred_y * 100.0)) # de-normalized
# -----------------------------------------------------------
# 6. save model (state_dict approach)
# print("\nSaving trained model state")
# fn = ".\\Models\\pollen_density_model.pt"
# T.save(net.state_dict(), fn)
# model = Net()
# model.load_state_dict(T.load(fn))
# use model to make prediction(s)
print("\nEnd Pollen density demo ")
if __name__ == "__main__":
main()

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
The first image shows the default parallel coordinates based on the density of the Pollen dataset. The second image highlights the borders. The last image shows the default data distribution.