Four reasons that I rarely use decision trees for regression are characteristics of trees in general and apply to regression and classification:
1. Decision trees are highly unstable — a small change in the training data creates a completely different tree (which effectively eliminates their interpretability).
2. Decision trees are highly susceptible to overfitting.
3. Techniques that deal with overfitting — bootstrap aggregation (“bagging”), random forest, adaptive boosting (“AdaBoost”), and gradient boosting — have the feeling of being hacks rather than mathematically principled (a personal opinion).
4. Decision trees have too many parameters to deal with in practice (an opinion).
A fifth reason I rarely use decision trees for regression is specific to regression problems:
5. A prediction is made using the average of the target values in the associated leaf node which just doesn’t seem right. (Suppose four leaf values are 0.9, 0.1, 0.1, 0.1; the prediction is 0.30).
However, the one reason why I sometimes use decision trees is:
1. They often work very well in practice.
Just for fun, I put together a demo using one of my standard regression examples. The goal is to predict a person’s income from sex, age, State (Michigan, Nebraska, Oklahoma), and political leaning (conservative, moderate, liberal). The data is synthetic and looks like:
1 0.24 1 0 0 0.2950 0 0 1 -1 0.39 0 0 1 0.5120 0 1 0 1 0.63 0 1 0 0.7580 1 0 0 -1 0.36 1 0 0 0.4450 0 1 0 1 0.27 0 1 0 0.2860 0 0 1 . . .
The tab-delimited fields are sex (male = -1, female = +1), age (divided by 100), State (Michigan = 100, Nebraska = 010, Oklahoma = 001), income (divided by 100,000), politics (conservative = 100, moderate = 010, liberal = 001). There are 200 training items and 40 test items.
I used the scikit library with default parameters:
# DecisionTreeRegressor(*, criterion='squared_error', # splitter='best', max_depth=None, min_samples_split=2, # min_samples_leaf=1, min_weight_fraction_leaf=0.0, # max_features=None, random_state=None, # max_leaf_nodes=None, min_impurity_decrease=0.0, # ccp_alpha=0.0) model = DecisionTreeRegressor(max_depth=None, random_state=1) model.fit(train_X, train_y)
To be honest, I was somewhat annoyed when the regression model worked quite well. It scored 98.00% accuracy on the training data and 85.00% accuracy on the test data. I defined an accurate income prediction as one that is within 10% of the true income.

I loved Gold Key comic books when I was a young man. Gold Key was created in 1962 as a spin-off of Dell comics. Here are three Gold Key covers that feature man-eating trees. Left: “Mighty Samson” (1962-1969, 20 issues) was set in a post-apocalyptic future. Center: “Space Family Robinson” (1962-1969, 36 issues) was essentially a comic book version of the “Lost in Space” TV show. Right: “Korak Son of Tarzan” (1964-1972, 45 issues) was one of Gold Key’s most popular titles.
Although “Samson” , “Robinson”, and “Korak” were all pretty good, my favorite series were “Turok” (Dell and Gold Key), “Ghost Stories” (Dell), “The Twilight Zone” (Gold Key), and “Boris Karloff Tales of Mystery” (Gold Key).
Demo code below. The data can be found at https://jamesmccaffreyblog.com/2022/10/10/regression-people-income-using-pytorch-1-12-on-windows-10-11/.
# people_income_tree.py
# Python 3.7.6 Windows 10/11
# scikit / sklearn 1.0.2
# predict income from sex, age, State, politics
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import pickle
# sex age state income politics
# -1 0.27 0 1 0 0.7610 0 0 1
# +1 0.19 0 0 1 0.6550 1 0 0
# -----------------------------------------------------------
def accuracy(model, data_X, data_y, pct_close):
# correct within pct of true income
n_correct = 0; n_wrong = 0
for i in range(len(data_X)):
X = data_X[i].reshape(1, -1) # one-item batch
y = data_y[i]
pred = model.predict(X) # predicted income
if np.abs(pred - y) "lt" np.abs(pct_close * y):
n_correct += 1
else:
n_wrong += 1
acc = (n_correct * 1.0) / (n_correct + n_wrong)
return acc
# -----------------------------------------------------------
def main():
print("\nRegression using scikit decision tree demo ")
print("Predict income from sex, age, State, political ")
# 0. prepare
np.random.seed(1)
# 1. load data
print("\nLoading data into memory ")
train_file = ".\\Data\\people_train.txt"
train_xy = np.loadtxt(train_file, delimiter="\t",
usecols=[0,1,2,3,4,5,6,7,8], comments="#",
dtype=np.float32)
train_X = train_xy[:,[0,1,2,3,4,6,7,8]]
train_y = train_xy[:,5].flatten() # 1D required
print("\nFirst four x = ")
print(train_X[0:4,:])
print(" . . . ")
print("\nFirst four y = ")
print(train_y[0:4])
print(" . . . ")
test_file = ".\\Data\\people_test.txt"
test_xy = np.loadtxt(test_file, delimiter="\t",
usecols=[0,1,2,3,4,5,6,7,8], comments="#",
dtype=np.float32)
test_X = test_xy[:,[0,1,2,3,4,6,7,8]]
test_y = test_xy[:,5].flatten() # 1D required
# -----------------------------------------------------------
# 2. create and train decision tree model
print("\nCreating and training decision tree regressor ")
# DecisionTreeRegressor(*, criterion='squared_error',
# splitter='best', max_depth=None, min_samples_split=2,
# min_samples_leaf=1, min_weight_fraction_leaf=0.0,
# max_features=None, random_state=None,
# max_leaf_nodes=None, min_impurity_decrease=0.0,
# ccp_alpha=0.0)
model = DecisionTreeRegressor(max_depth=None, random_state=1)
model.fit(train_X, train_y)
# 3. compute model accuracy
print("\nComputing accuracy (within 0.10) ")
acc_train = accuracy(model, train_X, train_y, 0.10)
print("Accuracy on train data = %0.4f " % acc_train)
acc_test = accuracy(model, test_X, test_y, 0.10)
print("Accuracy on test data = %0.4f " % acc_test)
# 4. make a prediction
print("\nPredicting income for M 34 Oklahoma moderate: ")
X = np.array([[-1, 0.34, 0,0,1, 0,1,0]],
dtype=np.float32)
pred_inc = model.predict(X)
print("$%0.2f" % (pred_inc * 100_000)) # un-normalized
# 5. save model
print("\nSaving model ")
fn = ".\\Models\\tree_model.pkl"
with open(fn,'wb') as f:
pickle.dump(model, f)
# load model
# with open(fn, 'rb') as f:
# loaded_model = pickle.load(f)
# pi = loaded_model.predict(X)
# print("$%0.2f" % (pi * 100_000)) # un-normalized
print("\nEnd scikit tree regression demo ")
if __name__ == "__main__":
main()

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2025 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2025 G2E Conference
2025 iSC West Conference
You must be logged in to post a comment.