Ordinal Regression Using a Neural Network

Ordinal regression is a cross between regression and classification.

A regression problem is one where the goal is to predict a single numeric value. For example, predicting a person’s annual income from age, sex, occupation, and region. A classification problem is one where the goal is to predict a single categorical value. For example, predicting a person’s region (“east”, “west”, “central”) from age, sex, income, and occupation.

A hybrid type of problem is called ordinal regression (also called ordinal classification). In ordinal regression the problem is to predict a categorical value where the possible values have an ordering. For example, predicting an employee’s job performance (“poor”, “average”, “good”) based on age, sex, education level, and pay rate. The idea here is that it makes sense to say “poor” is-less-than “average” is-less-than “good”.

Ordinal regression is surprisingly tricky and dozens of techniques have been proposed. The technique I usually use is a clever variation of standard neural network classification. Suppose you are trying to predict “poor”, “average, “good”, “excellent”. In a standard neural network classifier, the target values are one-hot encoded like poor = (1, 0, 0, 0), average = (0, 1, 0, 0), good = (0, 0, 1, 0), excellent = (0, 0, 0, 1). The network is trained with softmax activation to output a probability vector such as (0.30, 0.40, 0.10, 0.20) which would indicate a prediction of average with probability = 0.40.

You could use a standard neural network classifier on an ordinal problem but that approach doesn’t take advantage of the implied information you have about ordering. The clever approach I like was published by researcher J. Cheng in 2007 but doesn’t have a standard name, so I’ll call it the Cheng technique. You encode the target values like poor = (1, 0, 0, 0), average = (1, 1, 0, 0), good = (1, 1, 1, 0), excellent = (1, 1, 1, 1). Instead of softmax activation, you use logistic sigmoid activation to output a vector where each value is an individual probability such as (0.50, 0.80, 0.40, 0.60).

The technique is subtle and is based on several ideas that are used in other ordinal regression techniques. The research paper is “A Neural Network Approach to Ordinal Regression” (2007). The technique isn’t perfect. Although the technique works well in practice, the technique does not “ensure the monotonic decrease of the outputs of the neural network.”

Ordinal regression is somewhat related to ranking. In a ranking problem you assign a rank to each of a set of items. In ordinal classification you predict a single output value.


Judging a beauty pageant is sort of an ordinal regression problem. In some international pageants the women and girls wear national costumes as part of the judging. Left: Elaborate and beautiful costume worn by Miss Philippines. I rate it “excellent”. Center: Miss Japan wearing some sort of a weird comic book costume. I rate it as “that’s-Japan”. Right: Miss Thailand doing a horse style (or maybe it’s a cow). The rear end of the horse is perhaps Miss Assyria. I rate it as “what-the-heck”.

This entry was posted in Machine Learning. Bookmark the permalink.

2 Responses to Ordinal Regression Using a Neural Network

  1. Thorsten Kleppe's avatar Thorsten Kleppe says:

    Thank you for sharing this topic. I’ve never tried this idea, so it’s on my list.

    But how should we deal with the predictions? The ordinal regression prediction would overwhelm the poor class if we didn’t reject predictions like (0, 0, 0, 0) or (0, 0, 0, 1),
    because if a prediction (1, 0, 1, 1 ) or (1, 0, 0, 1) or (1, 0, 1, 0), which would probably also be classified as (1, 0, 0, 0) for poor. Or will that never happen?

    Logistic sigmoid and softmax are so closely related.
    The approach you showed here with sigmoid:
    poor = (1, 0, 0, 0), average = (1, 1, 0, 0), good = (1, 1, 1, 0), excellent = (1, 1, 1, 1)

    Looks pretty similar to a possible softmax approach:
    poor = (0.25, 0, 0, 0), average = (0, 0.5, 0, 0), good = (0, 0, 0.75, 0), excellent = (0, 0, 0, 1)
    If the target value for each class is changed as follows:
    target = 0.25 for poor, target = 0.5 for average, target for good = 0.75 and target for excellent = 1.0

    In this case, however, the excellent class would be overwhelming, which is strange in my intuition as if you were swapping the distribution.

  2. Hi Thorsten, Your question and comments are exactly what I thought when I first encountered this technique. My blog post left out a few key details — I had to read through Cheng’s paper several times to fully understand how and why his technique works. Briefly, prediction is normal — you get a vector of probabilities and then pick the largest probability. The tricky part is understanding why the encoding scheme works — it’s very subtle. JM

Comments are closed.