Five Reasons Why I Rarely Use Decision Trees for Regression and One Reason Why I Do

Four reasons that I rarely use decision trees for regression are characteristics of trees in general and apply to regression and classification:

1. Decision trees are highly unstable — a small change in the training data creates a completely different tree (which effectively eliminates their interpretability).

2. Decision trees are highly susceptible to overfitting.

3. Techniques that deal with overfitting — bootstrap aggregation (“bagging”), random forest, adaptive boosting (“AdaBoost”), and gradient boosting — have the feeling of being hacks rather than mathematically principled (a personal opinion).

4. Decision trees have too many parameters to deal with in practice (an opinion).

A fifth reason I rarely use decision trees for regression is specific to regression problems:

5. A prediction is made using the average of the target values in the associated leaf node which just doesn’t seem right. (Suppose four leaf values are 0.9, 0.1, 0.1, 0.1; the prediction is 0.30).

However, the one reason why I sometimes use decision trees is:

1. They often work very well in practice.

Just for fun, I put together a demo using one of my standard regression examples. The goal is to predict a person’s income from sex, age, State (Michigan, Nebraska, Oklahoma), and political leaning (conservative, moderate, liberal). The data is synthetic and looks like:

 1   0.24   1  0  0   0.2950   0  0  1
-1   0.39   0  0  1   0.5120   0  1  0
 1   0.63   0  1  0   0.7580   1  0  0
-1   0.36   1  0  0   0.4450   0  1  0
 1   0.27   0  1  0   0.2860   0  0  1
. . .

The tab-delimited fields are sex (male = -1, female = +1), age (divided by 100), State (Michigan = 100, Nebraska = 010, Oklahoma = 001), income (divided by 100,000), politics (conservative = 100, moderate = 010, liberal = 001). There are 200 training items and 40 test items.

I used the scikit library with default parameters:

  # DecisionTreeRegressor(*, criterion='squared_error',
  #  splitter='best', max_depth=None, min_samples_split=2,
  #  min_samples_leaf=1, min_weight_fraction_leaf=0.0,
  #  max_features=None, random_state=None,
  #  max_leaf_nodes=None, min_impurity_decrease=0.0,
  #  ccp_alpha=0.0)

  model = DecisionTreeRegressor(max_depth=None, random_state=1)
  model.fit(train_X, train_y)

To be honest, I was somewhat annoyed when the regression model worked quite well. It scored 98.00% accuracy on the training data and 85.00% accuracy on the test data. I defined an accurate income prediction as one that is within 10% of the true income.

I loved Gold Key comic books when I was a young man. Gold Key was created in 1962 as a spin-off of Dell comics. Here are three Gold Key covers that feature man-eating trees. Left: “Mighty Samson” (1962-1969, 20 issues) was set in a post-apocalyptic future. Center: “Space Family Robinson” (1962-1969, 36 issues) was essentially a comic book version of the “Lost in Space” TV show. Right: “Korak Son of Tarzan” (1964-1972, 45 issues) was one of Gold Key’s most popular titles.

Although “Samson” , “Robinson”, and “Korak” were all pretty good, my favorite series were “Turok” (Dell and Gold Key), “Ghost Stories” (Dell), “The Twilight Zone” (Gold Key), and “Boris Karloff Tales of Mystery” (Gold Key).

Demo code below. The data can be found at https://jamesmccaffreyblog.com/2022/10/10/regression-people-income-using-pytorch-1-12-on-windows-10-11/.

# people_income_tree.py
# Python 3.7.6  Windows 10/11 
# scikit / sklearn 1.0.2

# predict income from sex, age, State, politics

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import pickle

# sex age   state   income   politics
# -1  0.27  0 1 0   0.7610   0 0 1
# +1  0.19  0 0 1   0.6550   1 0 0

# -----------------------------------------------------------

def accuracy(model, data_X, data_y, pct_close):
  # correct within pct of true income
  n_correct = 0; n_wrong = 0

  for i in range(len(data_X)):
    X = data_X[i].reshape(1, -1)  # one-item batch
    y = data_y[i]
    pred = model.predict(X)       # predicted income

    if np.abs(pred - y) "lt" np.abs(pct_close * y):
      n_correct += 1
    else:
      n_wrong += 1
  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def main():
  print("\nRegression using scikit decision tree demo ")
  print("Predict income from sex, age, State, political ")

  # 0. prepare
  np.random.seed(1)

  # 1. load data
  print("\nLoading data into memory ")
  train_file = ".\\Data\\people_train.txt"
  train_xy = np.loadtxt(train_file, delimiter="\t", 
    usecols=[0,1,2,3,4,5,6,7,8], comments="#", 
    dtype=np.float32)
  train_X = train_xy[:,[0,1,2,3,4,6,7,8]]
  train_y = train_xy[:,5].flatten()  # 1D required

  print("\nFirst four x = ")
  print(train_X[0:4,:])
  print(" . . . ")
  print("\nFirst four y = ")
  print(train_y[0:4])
  print(" . . . ")

  test_file = ".\\Data\\people_test.txt"
  test_xy = np.loadtxt(test_file, delimiter="\t", 
    usecols=[0,1,2,3,4,5,6,7,8], comments="#", 
    dtype=np.float32)
  test_X = test_xy[:,[0,1,2,3,4,6,7,8]]
  test_y = test_xy[:,5].flatten()  # 1D required

# -----------------------------------------------------------

  # 2. create and train decision tree model
  print("\nCreating and training decision tree regressor ")

  # DecisionTreeRegressor(*, criterion='squared_error',
  #  splitter='best', max_depth=None, min_samples_split=2,
  #  min_samples_leaf=1, min_weight_fraction_leaf=0.0,
  #  max_features=None, random_state=None,
  #  max_leaf_nodes=None, min_impurity_decrease=0.0,
  #  ccp_alpha=0.0)

  model = DecisionTreeRegressor(max_depth=None, random_state=1)
  model.fit(train_X, train_y)

  # 3. compute model accuracy
  print("\nComputing accuracy (within 0.10) ")
  acc_train = accuracy(model, train_X, train_y, 0.10)
  print("Accuracy on train data = %0.4f " % acc_train)
  acc_test = accuracy(model, test_X, test_y, 0.10)
  print("Accuracy on test data = %0.4f " % acc_test)

  # 4. make a prediction
  print("\nPredicting income for M 34 Oklahoma moderate: ")
  X = np.array([[-1, 0.34, 0,0,1,  0,1,0]],
    dtype=np.float32)
  pred_inc = model.predict(X)
  print("$%0.2f" % (pred_inc * 100_000))  # un-normalized

  # 5. save model
  print("\nSaving model ")
  fn = ".\\Models\\tree_model.pkl"
  with open(fn,'wb') as f:
    pickle.dump(model, f)

  # load model
  # with open(fn, 'rb') as f:
  #   loaded_model = pickle.load(f)
  # pi = loaded_model.predict(X)
  # print("$%0.2f" % (pi * 100_000))  # un-normalized

  print("\nEnd scikit tree regression demo ")

if __name__ == "__main__":
  main()