Generating Non Linearly Separable Test Data

This morning I was working on a kernel logistic regression (KLR) problem. Regular logistic regression (LR) is perhaps the simplest form of machine learning (ML). It’s used when the problem is to predict a binary value, using two or more numeric values. For example, you might want to predict if a person is Male (0) or Female (1), based on height, weight, and annual income.

The problem with regular LR is that it only works with data that is linearly separable — if you graph the data, you must be able to draw a straight line that more or less separate the two classes you’re trying to predict.

Kernel logistic regression can handle non linearly separable data. For example, the graph below might represent the predict-the-sex problem where there are just two input values, say, height and weight.

Well, anyway, in order to test my kernel logistic regression ML code, I needed some non linearly separable data. I was about to start writing some C# code when quite by accident I came across a Python function named make_circles() that made the data shown in the graph above.

The code is simple:

# makeCircles.py

import numpy as np
from sklearn.datasets import make_circles

np.random.seed(4)
numPoints = 20
X, y = make_circles(n_samples=numCircles,
  factor=.3, noise=.05)
X = 10 * X
for i in range(0, numPoints):
  print(str(X[i][0]) + ", " +
        str(X[i][1]) + ", " +
        str(y[i]))

I ran the script as

(prompt) python makeCircles.py > nonSepData.txt

Then I opened the comma-delimited file in Excel, sorted the data on the 0-or-1 column, and made a graph. By adjusting the print() function I can control the exact form of the output.

The downside of this technique is that it can only generate data with two dimensions.