“Random Forest Regression and Bagging Regression Using C#” in Visual Studio Magazine

I wrote an article titled “Random Forest Regression and Bagging Regression Using C#” in the January 2025 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/Articles/2025/01/02/Random-Forest-Regression-and-Bagging-Regression-Using-CSharp.aspx.

A machine learning random forest regression system predicts a single numeric value. A random forest is an ensemble (collection) of simple decision tree regressors that have been trained on different random subsets of the source training data. To make a prediction for an input vector x, each tree makes a prediction and the final predicted y value is the average of the predicted values computed by the individual trees.

The training data subsets used by random forest consist of only some of the rows of the source training data (with some rows possibly duplicated), and some or all of the columns.

A bagging (“bootstrap aggregation”) regression system is a specific type of random forest system where only some of the rows but all of the columns/predictors of the source training data are used to construct the training data subsets.

My article presents a complete demo, using from scratch C# (no dependencies). The synthetic data used looks like:

-0.1660,  0.4406, -0.9998, -0.3953, -0.7065,  0.4840
 0.0776, -0.1616,  0.3704, -0.5911,  0.7562,  0.1568
-0.9452,  0.3409, -0.1654,  0.1174, -0.7192,  0.8054
 0.9365, -0.3732,  0.3846,  0.7528,  0.7892,  0.1345
. . .

The first five values on each line are the x predictors. The last value on each line is the target y variable to predict. There are 200 training items and 40 test items.

The output of the demo program is:

Setting numTrees = 100
Setting maxDepth = 6
Setting minSamples = 2
Setting nRows = 200
Setting nCols = 5
Creating and training RandomForestRegression model
Done

Evaluating model
Accuracy train (within 0.15) = 0.9250
Accuracy test (within 0.15) = 0.7250
Predicting for x =
  -0.1660   0.4406  -0.9998  -0.3953  -0.7065
Predicted y = 0.4828

The motivation for combining many simple decision tree regressors into a forest is the fact that a simple decision tree will always overfit training data if the tree is deep enough. A deep enough decision tree will predict its training data perfectly (except for very unusual data scenarios), but is likely to predict poorly on new, previously unseen data. By using a collection of trees that have been trained on different subsets of the source data, the averaged prediction of the collection is much less likely to overfit.

The demo source training data has 200 rows and nRows is set to 200, so the data subsets have the same number of rows as the original source training data. This isn’t required and nRows can be smaller or larger than the number of source rows. As a very general rule of thumb, nRows is often a value between one-half and twice the number of source rows.

Random forest and bagging regression systems are closely related to techniques called adaptive boosting regression and gradient boosting regression. Examples include AdaBoost (adaptive boosting) regression, gradient boosting machine (GBM) regression, LightGBM regression, and XGBoost (extreme gradient boosting) regression.



In machine learning, “bagging” is a shortened form of “bootstrap aggregation”. I’ve never liked either term — “bagging” is too informal and “bootstrap aggregation” doesn’t describe the technique at all.

Left: In American sports, fans who are not pleased with their team will sometimes wear paper bags over their heads to suggest they’re too embarrassed to be seen. There’s a lot of interesting psychology about sports fans, related to a sense of community and belonging. Strange.

Center: For many decades, Blacks in the U.S. used the “paper bag test” where a person’s skin color was physically compared with the color of a brown paper shopping bag — a bag was literally placed next to a person’s face. Only those whose skin tone was lighter or equal to the color of a paper bag were admitted to Black sororities, social groups, choirs, clubs, and so on. According to two of my work colleagues, the paper bag test is still used today, but implicitly, and is called colorism. Very strange.

Right: I’ve noticed that high end luxury items, like expensive designer bags and expensive designer jewelry, often get very fancy bags. Extending this idea to an extreme is this fashion model wearing only a fancy bag. Very, very strange.


This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply