Decision Trees with the scikit-learn Library

Over the past three years or so, neural networks have come to dominate many areas of machine learning. But traditional ML techniques are still useful in some scenarios.

A decision tree is just a set of if-then rules for classification. There are many decision tree standalone tools and code libraries. For relatively simple problems, my usual approach is to use the scikit-learn (sklearn) code library. The sklearn library does not support categorical predictor variables (except binary categorical predictors) — I have seen a lot of incorrect information on this topic. Categorical predictors in a decision tree is a very tricky subject.

There are many good examples of creating a decision tree using scikit-learn, but I never fully understand a technology until I dive into the code myself. So I created an end-to-end example to refresh my memory.

I created a dummy dataset with 12 items:

28	female	27000	democrat
39	male	97000	independent
38	female	64000	republican
27	male	82000	independent
36	male	48000	democrat
55	female	56000	democrat
44	male	88000	independent
42	male	39000	republican
21	male	43000	republican
49	female	91000	independent
30	female	85000	democrat
56	male	41000	republican

The first three columns are age, sex, annual income. The goal is to predict political party affiliation, in the fourth column. A sklearn decision tree only accepts numeric predictors (but allows non-numeric labels-to-predict) so I converted the male-female data to 0-1:

28	1	27000	democrat
39	0	97000	independent
38	1	64000	republican
27	0	82000	independent
36	0	48000	democrat
55	1	56000	democrat
44	0	88000	independent
42	0	39000	republican
21	0	43000	republican
49	1	91000	independent
30	1	85000	democrat
56	0	41000	republican

Next, I wrote a program, using parts of several examples I found, plus a bit of new stuff I figured out myself though experimentation and the sklearn documentation.


Click to enlarge

I don’t use decision trees very often. But for very small datasets or when it’s important to be able to interpret how a prediction model makes a specific prediction, decision trees can be very useful.



An Internet image search for just about any word returns some strange results. “Trees”

This entry was posted in Machine Learning. Bookmark the permalink.