A PyTorch Dataset Using the Pandas read_csv() Function

To train a PyTorch neural network, the most common approach is to read training data into a Dataset object, and then use a DataLoader object to serve the training data up in batches. When I implement a Dataset, I almost always use the NumPy loadtxt() function to read training data from file into memory. But it’s possible to use the Pandas read_csv() function instead. Bottom line: the Pandas approach isn’t especially useful because the Pandas data frame has to be converted to a NumPy matrix anyway.

I used one of my standard examples to code up a demo of NumPy loadtxt() vs Pandas read_csv() functions. The goal is to predict political leaning (conservative = 0, moderate = 1, liberal = 2) from sex, age, state of residence, and income. The data looks like:

 1  0.24  1  0  0  0.2950  2
-1  0.39  0  0  1  0.5120  1
 1  0.63  0  1  0  0.7580  0
-1  0.36  1  0  0  0.4450  1
 1  0.27  0  1  0  0.2860  2
. . .

The columns are sex (M = -1, F = +1), age divided by 100, state (Michigan = 100, Nebraska = 010, Oklahoma = 001), income divided by $100,000, and political leaning. The data is synthetic.

A standard NumPy loadtxt() version of a Dataset is:

import numpy as np
import pandas as pd  # not used this version

class PeopleDataset(T.utils.data.Dataset):
  def __init__(self, src_file):
    # numpy loadtxt() version
    all_xy = np.loadtxt(src_file, usecols=range(0,7),
      delimiter="\t", comments="#", dtype=np.float32)

    tmp_x = all_xy[:,0:6]   # cols [0,6) = [0,5]
    tmp_y = all_xy[:,6]     # 1-D

    self.x_data = T.tensor(tmp_x, 
      dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y,
      dtype=T.int64).to(device)  # 1-D

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    trgts = self.y_data[idx] 
    return preds, trgts  # as a Tuple

A version using the Pandas read_csv() and the to_nump() method is:

class PeopleDataset(T.utils.data.Dataset):
  def __init__(self, src_file):
    # pandas version
    xy_frame = pd.read_csv(src_file, usecols=range(0,7),
      delimiter="\t", comment="#", dtype=np.float32)
    all_xy = xy_frame.to_numpy()

    # as above
. . .

Instead of using the Pandas to_numpy() function, it’s possible to access the Pandas dataframe directly using the iloc property:

class PeopleDataset(T.utils.data.Dataset):
  def __init__(self, src_file):
    # pandas version
    xy_frame = pd.read_csv(src_file, usecols=range(0,7),
      delimiter="\t", comment="#", dtype=np.float32)
    all_xy = np.array(xy_frame.iloc[:,:])

    # as above
. . .

The rest of the program and the training and test data can be found at: https://jamesmccaffreyblog.com/2022/09/01/multi-class-classification-using-pytorch-1-12-1-on-windows-10-11/.

There’s no big moral to this story — just some fun mental exercise to stay in practice with PyTorch.

Two wonderful illustrations tagged as “amazingsurf” from fractal.batjorge.com. I don’t know the artist, but I’ll bet he does artistic exercises to stay in practice.