Error, Loss, Risk, and Likelihood in Machine Learning

I was listening to a research lecture last week. The speaker was careful to distinguish between the terms error, loss, risk, and likelihood. The differences are subtle. And the terms are often used interchangeably.

I’ll explain with an example, but first let me emphasize that vocabulary is just vocabulary. In any kind of a meaningful context, it’s important to define exactly what all terms mean.

Briefly, error is the difference between a single actual value and a single predicted value. Loss is the average error over training data. Risk is the average error over all data. Likelihood is the probability of getting a particular set of results across training data.

Suppose you have a problem where you want to predict whether a person living in the U.S. will develop heart disease within the next 12 months. The population is all people living in the U.S. You collect a set of data for 50 people with known results and divide it into 40 people for training an ML model and 10 people for testing the accuracy of the model.

You use the training data to create a math equation that accepts predictor values like age, sex, blood pressure, and so on. The equation emits a value between 0.0 and 1.0 where a value less than 0.5 means no heart disease and a value greater than 0.5 means heart disease.

Error often, but not always, means the difference between a single predicted value and the associated prediction. For example, suppose the actual value is 1 (person will develop heart disease) and the predicted value is 0.74 then the error is 1 – 0.74 = 0.26. However, error is also commonly used in at least two other ways. First, error can mean something like squared error for a single item: (1 – 0.74)^2 = 0.26^2 = 0.0676. Second, error can also mean the average squared error (or cross entropy error) across all 40 training items.

Loss is basically synonymous with error but usually means average error across all items in the training data. The loss function must be established before training because minimizing the loss function, typically mean squared error or mean cross entropy error, determines how the training algorithm works.

Click to enlarge. Derivation of the logistic regression update rule from maximum likelihood estimation. Beautiful. From xxxx://cs229.stanford.edu/notes/cs229-notes1.pdf (use “http”).

Risk usually means average error across the entire population. Because you can never calculate this, risk is more of a research topic than an engineering topic. The term empirical (“based on observation”) risk minimization is kind of a vocabulary hack that means minimizing average error across the training data — exactly the same as loss.

Likelihood usually means the probability of getting a set of predicted values. Suppose your 40 actual values on the training data are (0, 1, 1, . .) and the predicted values are (0.22, 0.78, 0.59, . . ). To train a model, instead of using an algorithm that minimizes average loss/error, you can use an algorithm that maximizes the probability of getting the desired output set. Weirdly, in many cases the algorithm used to maximize likelihood turns out to be the exact same as the algorithm used to minimize average loss/error.

The bottom line is that the differences in ML terminology can be subtle and usually aren’t worth worrying about when you’re trying to solve a problem. The differences are worth worrying about when you’re writing a research paper and have to define everything exactly.

Four Disney-related works by artist Mark Swanson