"Quadratic Regression with SGD Training Using C#" in Visual Studio Magazine

I wrote an article titled “Quadratic Regression with SGD Training Using C#” in the January 2026 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2026/01/21/quadratic-regression-with-sgd-training-using-csharp.aspx.

The goal of a machine learning regression problem is to predict a single numeric value. For example, you might want to predict an employee’s salary based on age, height, high school grade point average, and so on. There are approximately a dozen common regression techniques. The most basic technique is called linear regression, or sometimes multiple linear regression, where the “multiple” indicates two or more predictor variables.

The form of a basic linear regression prediction model is y’ = (w0 * x0) + (w1 * x1) + . . + (wn * xn) + b, where y’ is the predicted value, the xi are predictor values, the wi are weights, and b is the bias. Quadratic regression extends linear regression. The form of a quadratic regression model is y’ = (w0 * x0) + . . + (wn * xn) + (wj * x0 * x0) + . . + (wk * x0 * x1) + . . . + b. There are derived predictors that are the square of each original predictor, and interaction terms that are the multiplication product of all possible pairs of original predictors.

Compared to basic linear regression, quadratic regression can handle more complex data. Compared to the most powerful regression techniques such as neural network regression, quadratic regression often has slightly worse prediction accuracy, but has much better model interpretability.

There are several ways to train a quadratic regression model, including stochastic gradient descent (SGD), pseudo-inverse training, closed form inverse training, L-BFGS optimization training and so on. The demo program uses SGD training, which is iterative and requires a learning rate and a maximum number of epochs. These two parameter values must be determined by trial and error.

The output of the demo program is:

Begin C# quadratic regression with SGD training

Loading synthetic train (200) and test (40) data
Done

First three train X:
 -0.1660  0.4406 -0.9998 -0.3953 -0.7065
  0.0776 -0.1616  0.3704 -0.5911  0.7562
 -0.9452  0.3409 -0.1654  0.1174 -0.7192

First three train y:
  0.4840
  0.1568
  0.8054

Creating quadratic regression model

Setting lrnRate = 0.001
Setting maxEpochs = 1000

Starting SGD training
epoch =     0  MSE =   0.0957
epoch =   200  MSE =   0.0003
epoch =   400  MSE =   0.0003
epoch =   600  MSE =   0.0003
epoch =   800  MSE =   0.0003
Done

Model base weights:
 -0.2630  0.0354 -0.0420  0.0341 -0.1124

Model quadratic weights:
  0.0655  0.0194  0.0051  0.0047  0.0243

Model interaction weights:
  0.0043  0.0249  0.0071  0.1081 -0.0012 -0.0093
  0.0362  0.0085 -0.0568  0.0016

Model bias/intercept:   0.3220

Evaluating model
Accuracy train (within 0.10) = 0.8850
Accuracy test (within 0.10) = 0.9250

MSE train = 0.0003
MSE test = 0.0005

Predicting for x =
  -0.1660   0.4406  -0.9998  -0.3953  -0.7065

Predicted y = 0.4843

End demo

Suppose, as in the demo data, there are five predictors, aka features, (x0, x1, x2, x3, x4). The prediction equation for basic linear regression is:

y' = (w0 * x0) + (w1 * x1) + (w2 * x2) + (w3 * x3) + (w4 * x4) + b

The wi are model weights (aka coefficients), and b is the model bias (aka intercept). The values of the weights and the bias must be determined by training, so that predicted y’ values are close to the known, correct y values in a set of training data.

Basic linear regression is simple but it can’t predict well for data that has an underlying non-linear structure, and basic linear regression can’t deal with data that has hidden interactions between the xi predictors.

The prediction equation for quadratic regression with five predictors is:

y' = (w0 * x0) + (w1 * x1) + (w2 * x2) + (w3 * x3) +
     (w4 * x4) + 

     (w5 * x0*x0) + (w6 * x1*x1) + (w7 * x2*x2) +
     (w8 * x3*x3) + (w9 * x4*x4) + 

     (w10 * x0*x1) + (w11 * x0*x2) + (w12 * x0*x3) +
     (w13 * x0*x4) + (w14 * x1*x2) + (w15 * x1*x3) +
     (w16 * x1*x4) + (w17 * x2*x3) + (w18 * x2*x4) + 
     (w19 * x3*x4)

     + b

The squared (aka “quadratic”) xi^2 terms handle non-linear structure. If there are n predictors, there are also n squared terms. The xi * xj terms between all possible pairs pf original predictors handle interactions between predictors. If there are n predictors, there (n * (n-1)) / 2 interaction terms.

Quadratic regression has a nice balance of prediction power and interpretability. The model weights/coefficients are easy to interpret. If the predictor values have been normalized to the same scale, larger magnitudes mean larger effect, and the sign of the weights indicate the direction of the effect.

Quadratic regression is a classical machine learning technique that still has a lot of appeal. Classical science fiction magazines often featured covers with giant insects. Here are three with giant ants. Left: “Amazing Stories”, Fall 1928. Center: “Thrilling Wonder Stories”, December 1938. Right: The German “Utopia” was a series of magazines / short novels, published every other week, from 1953 to 1968. This is #192 from September 15, 1959.