Implementing a Coefficient of Determination Function Using C#

The goal of a machine learning regression problem is to predict a single numeric value, for example, predicting the bank account balance of a person based on his age, annual income, and so on.

A prediction model can be evaluated in several ways. The four most common are accuracy, mean squared error (MSE), root mean squared error RMSE), and coefficient of determination (aka R2 aka R^2). For accuracy and R2, larger values are better. For MSE and RMSE, smaller values are better.

Accuracy is the percentage of correct predictions, where a correct prediction is one that’s within a specified closeness (typically about 10% or so) to the true target value.

MSE is the average of the squared differences between predicted y and target y values. If the y values have units, such as dollars, MSE has units-squared, such as dollars-squared.

RMSE is just the square root of MSE, which, if the y values has units, gives units instead of the awkward units-squared.

R2 is sort of like accuracy but it doesn’t require a closeness percentage. R2 = 1.0 – (u / v) where u = sum(y – y’)^2 and v = sum(y – y”)^2. The y is actual target, y’ is predicted target y, and y” is the average of the actual target y values. R2 measures how well the model predicts relative to guessing the average of the target y values. This is sometimes described as the proportion of the variance explained by the model.

Here’s an implementation of R2 using the C# language (replace “lt” with Boolean less-than symbol):

public double R2(double[][] dataX, double[] dataY)
{
  int n = dataX.Length;
  double sum = 0.0;

  for (int i = 0; i "lt" n; ++i)
    sum += dataY[i];
  double meanActual = sum / n;

  double sumTop = 0.0;
  double sumBot = 0.0;
  for (int i = 0; i "lt" n; ++i) {
    double predY = this.Predict(dataX[i]);
    sumTop += 
      (dataY[i] - predY) * (dataY[i] - predY);
    sumBot += 
      (dataY[i] - meanActual) * (dataY[i] - meanActual);
  }
  return 1.0 - (sumTop / sumBot);
}

This version is a class method that would be defined inside a class like LinearRegressor. It also assumes there is a Predict() method.

An alternative design is to define an external function like:

static double R2(LinearRegressor model,
  double[][] dataX, double[] dataY)
{
  . . .
  double predY = model.Predict(dataX[i]);
  . . .
}

One of the main reasons to implement and use an R2 evaluation metric is that all of the regression models in the widely used scikit-learn Python library define a “score” attribute that returns R2. If you implement an R2 metric, you can easily compare non-scikit regression models with scikit regression models.

Evaluating a machine learning regression model is fairly straightforward and objective. Evaluating science fiction novels is essentially subjective. My favorite science fiction series is the Mars series by Edgar Rice Burroughs. “Thuvia, Maid of Mars” (1920) is the fourth book in the series. John Carter’s son, Cathoris, sets out to rescue the kidnapped princess Thuvia. I give this book a B+ evaluation.