The Difference Between Linearly Separable Data and a Linear Classifier in Machine Learning

The terms “linearly separable data” and “linear classifier” often appear in the context of machine learning. The terms sound a lot alike but aren’t closely related in the context of ML, even though the terms are closely related in the context of mathematics in general.

Here’s a basic bottom line: linearly separable data is simple data that can be classified using simple or complex ML techniques; a linear classifier is a category of many ML techniques that can be used for either simple or complex data.

There’s no well-defined relationship such as, “a linear classifier only works on linearly separable data” or “data that is not linearly separable can only be classified using a non-linear classifier.”

Before I move on, let me emphasize that in practice, you don’t need to know all the following details in order to effectively apply ML. The detailed distinctions and properties of “linearly separable data” and “linear classifier” are extremely complicated but are most useful in research where terminology must be defined exactly.

OK, first, linearly separable data. Linearly separable data is data that if graphed in two dimensions, can be separated by a straight line. Here’s an example:

This data is linearly separable because there is a line (actually many lines) from lower left to upper right that separates the red and blue classes.

You can imagine the data represents people and the goal is to predict political conservative (red) or liberal (blue) based on age (predictor x0) and income (predictor x1). This data is linearly separable because there is a straight line from lower left to upper right that separates the red and blue data.

Neither of these two datasets is linearly separable. The data on the left needs two straight lines. The data on the right needs a curved line.

Now, at this point, there are many additional details about linearly separable data that I won’t discuss: three or more predictor variables, three or more classes to predict, the kernel trick to convert to higher dimensionality, non-numeric predictor variables, etc. The main point is that linearly separable data is simple, and therefore many simple ML techniques apply. These simple techniques include perceptrons, basic support vector machines, linear discriminant analysis, and others. And linearly separable data can be solved using sophisticated techniques such as neural networks, k-nearest neighbors, and many others.

OK, now what about linear classifiers? A linear classifier is a ML technique that uses a mathematical linear combination of the predictor variables. A good example of linear classifier is logistic regression (LR). One possible form of the prediction equation for LR is p = 1.0 / (1.0 + exp(-z)) where z = a + w0*x0 + w1*x1. If p is less than 0.5 the prediction is class 0.If p is greater than 0.5 the prediction is class 1.

The characteristic that makes this a linear predictor is the z value which is a mathematical linear combination of the predictor variables (x0, x1) and some weights (w0, w1). There are only additions and multiplications — no squares, or square roots, or sines, etc.

Now notice that the LR prediction equation itself is not linear because it uses the exp() function. The key thing that makes a linear classifier a linear classifier is that a linear combination of the predictor variables occurs somewhere in the prediction equation. Examples of ML classification techniques that are not linear include neural networks, k-nearest neighbors, and decision trees.

Linearly separable data is always simple. But linear classifiers can be simple or complex.

OK, I could write many more pages, but the discussion above has hit the two main points. But before I stop, I have to mention one of hundreds of exceptions and quirks. As it turns out, logistic regression (which is a medium-complexity linear classifier) has great trouble dealing with linearly separable data. However, this trouble is really theoretical and in practice logistic regression works fine for linearly separable data.

Precisely defined terminology is important in technical fields in order to prevent miscommunication. But bogus terminology is bad. The Six Sigma fad of the late 1990s and early 2000s was hilarious — hundreds of ridiculous terms, acronyms, and things like “green belts” to establish a cult of insiders. A company I worked for at one time bought into Six Sigma — and I knew it was time for me to bail out. Any company where management leadership couldn’t see the absurdity of Six Sigma . . . and yes, the company collapsed a couple years after going Six Sigma.