Checking Machine Learning Training Data for Multicollinearity Using VIF (Variance Inflation Factor) from Scratch Python

In machine learning, if training data is multicollinear, the resulting model will likely be poor. The most common way to analyze training data for multicollinearity is to compute the VIF (variance inflation factor) for each column of the data.

VIF is a value between 1.0 and positive infinity (well in weird scenarios, a VIF value could be less than one). Briefly, if all column VIF values are less than about 7.0, the data is probably OK.

if VIF is close to 1.0, the column is not correlated.
if VIF between 1.0 and 5.0, column is mildly correlated
if VIF between 5.0 and 10.0, column is highly correlated
if VIF greater than 10.0, column is extremely correlated

To compute the VIF for a specified column of training data, you use the specified column as the dependent y variable, and use the remaining columns as the independent predictor variables, and compute a linear regression model, and then compute the R2 (coefficient of determination) for the model. The VIF value for the column is 1.0 / (1.0 – R2).

Suppose that you have a set of training data X predictor values, and you use some column c as the dependent y variable, and all the other columns as predictors for c. After training the linear regression model, you compute R2 and it is 0.90 — which means column c is predicted very well by the other columns. The VIF value for column c is 1.0 / (1.0 – R2) = 1.0 / 0.10 = 10.0 which is large which is bad because column c is a linear combination of the other columns — the data is somewhat multicollinear. Now, with the same setup, suppose R2 is 0.20 — which means column c cannot be predicted well by the other columns. The VIF value is 1.0 / (1.0 – 0.20) = 1.0 / 0.8 = 1.25 which is a small value, which is good, because column c is not a linear combination of the other columns, and therefore the data is not multicollinear.

I put together a demo using Python and the scikit library. I created two datasets. The first data set has five columns of predictors, followed by a column of target y values. The data is “normal” in the sense that there’s no multicollinearity. There are 20 items. It looks like:

-0.1660,  0.4406, -0.9998, -0.3953, -0.7065,  0.4840
 0.0776, -0.1616,  0.3704, -0.5911,  0.7562,  0.1568
-0.9452,  0.3409, -0.1654,  0.1174, -0.7192,  0.8054
. . .

The second dataset is highly multicollinear, where the third column is 2 times the first column, plus the second column, plus a small random value between 0.000 and 0.001. It looks like:

-0.1660,  0.4406,  0.1096, -0.3953, -0.7065, 0.4840
 0.0776, -0.1616, -0.0045, -0.5911,  0.7562, 0.1568
-0.9452,  0.3409, -1.5482,  0.1174, -0.7192, 0.8054
. . .

The output of the demo program is:

Begin variance inflation factor demo

Loading synthetic (20) normal data

First two X:
[-0.1660  0.4406 -0.9998 -0.3953 -0.7065]
[ 0.0776 -0.1616  0.3704 -0.5911  0.7562]

Begin VIF analysis
col =   0 | vif = 1.1980
col =   1 | vif = 1.4591
col =   2 | vif = 1.2345
col =   3 | vif = 1.3025
col =   4 | vif = 1.2120

Loading synthetic (20) multicollinear data
(col[2] = 2.0 * col[0] + col[1] + rnd)

First two X:
[-0.1660  0.4406  0.1096 -0.3953 -0.7065]
[ 0.0776 -0.1616 -0.0045 -0.5911  0.7562]

Begin VIF analysis
col =   0 vif = 25546262.9389
col =   1 vif = 6023299.3886
col =   2 vif = 30951889.1370
col =   3 vif = 1.2937
col =   4 vif = 1.2117

End demo

As expected, the first dataset didn’t have any bad VIF values, but the VIF values for the second dataset show that columns [0], [1], [2] are highly correlated.

No moral to this blog post. Just an interesting exploration.



In machine learning, you don’t want a relationship between two columns in your training data. But in science fiction movies, you absolutely want a good relationship between the hero and the main actress.

I’m a huge fan of science fiction movies from the 1950s and 1960s. Here are posters of two films that were good, but they could have been great if the chemistry between the hero and the main lady were better.

Left: In “Crack in the World” (1965), scientists create a project to drill to the Earth’s magma center to gain a source of unlimited heat, and therefore unlimited energy. The plan involves firing a thermonuclear missile into a hole. This was not a good idea, to put it mildly. The chemistry between Dr. Rampion (actor Kieron Moore) and the wife of his boss, Dr. Sorensen (actress Janette Scott) was, well, one with no chemistry. But I give the movie a B grade anyway.

Right: In “The Day the Earth Caught Fire” (1961), The U.S. and the Soviets unknowingly explode nuclear test weapons at the same time on the same day. This was not a good idea, to put it mildly. The Earth is knocked out of orbit, towards the Sun. Only exploding every nuclear device on the planet simultaneously might save humanity. The chemistry between newspaper reporter Peter Stenning (actor Edward Judd) and office worker Jeannie Craig (actress Janet Munro — she’s one of my sci fi favorites) was awkward and unconvincing. But I give the movie a B- grade anyway.


Demo program:

# variance_inflation_factor.py

import numpy as np
from sklearn.linear_model import LinearRegression

np.set_printoptions(precision=4, suppress=True,
    floatmode='fixed')

def vif(data, i):
  # vif = 1.0 / (1.0 - R2) if col [i] is dependent variable
  X = np.delete(data, i, axis=1) # all cols except i
  y = data[:,i]

  model = LinearRegression()
  model.fit(X, y)
  r2 = model.score(X, y)
  result = 1.0 / (1.0 - r2)
  return result

# -----------------------------------------------------------

print("\nBegin variance inflation factor demo ")

print("\nLoading synthetic (20) normal data ")
train_X = \
  np.loadtxt(".\\Data\\synthetic_train_20.txt",
  usecols=[0,1,2,3,4], delimiter=",")
 
print("\nFirst two X: ")
for i in range(2):
  print(train_X[i])

print("\nBegin VIF analysis ")

for c in range(len(train_X[0])):
  z = vif(train_X, c)
  print("col = %3d | vif = %0.4f " % (c, z))

print("\nLoading synthetic (20) mulicollinear data ")
print("(col[2] = 2.0 * col[0] + col[1] + rnd) ")
train_X = \
  np.loadtxt(".\\Data\\synthetic_train_20_collinear.txt",
  usecols=[0,1,2,3,4], delimiter=",")

print("\nFirst two X: ")
for i in range(2):
  print(train_X[i])

print("\nBegin VIF analysis ")

for c in range(len(train_X[0])):
  z = vif(train_X, c)
  print("col = %3d vif = %0.4f " % (c, z))
print("\nEnd demo ")

First, normal, dataset:

# synthetic_train_20.txt
#
-0.1660,  0.4406, -0.9998, -0.3953, -0.7065,  0.4840
 0.0776, -0.1616,  0.3704, -0.5911,  0.7562,  0.1568
-0.9452,  0.3409, -0.1654,  0.1174, -0.7192,  0.8054
 0.9365, -0.3732,  0.3846,  0.7528,  0.7892,  0.1345
-0.8299, -0.9219, -0.6603,  0.7563, -0.8033,  0.7955
 0.0663,  0.3838, -0.3690,  0.3730,  0.6693,  0.3206
-0.9634,  0.5003,  0.9777,  0.4963, -0.4391,  0.7377
-0.1042,  0.8172, -0.4128, -0.4244, -0.7399,  0.4801
-0.9613,  0.3577, -0.5767, -0.4689, -0.0169,  0.6861
-0.7065,  0.1786,  0.3995, -0.7953, -0.1719,  0.5569
 0.3888, -0.1716, -0.9001,  0.0718,  0.3276,  0.2500
 0.1731,  0.8068, -0.7251, -0.7214,  0.6148,  0.3297
-0.2046, -0.6693,  0.8550, -0.3045,  0.5016,  0.2129
 0.2473,  0.5019, -0.3022, -0.4601,  0.7918,  0.2613
-0.1438,  0.9297,  0.3269,  0.2434, -0.7705,  0.5171
 0.1568, -0.1837, -0.5259,  0.8068,  0.1474,  0.3307
-0.9943,  0.2343, -0.3467,  0.0541,  0.7719,  0.5581
 0.2467, -0.9684,  0.8589,  0.3818,  0.9946,  0.1092
-0.6553, -0.7257,  0.8652,  0.3936, -0.8680,  0.7018
 0.8460,  0.4230, -0.7515, -0.9602, -0.9476,  0.1996

Second, multicollinear, dataset:

# synthetic_train_20_collinear.txt
# col [2] = 2*[0] + [1] + rand(0.001)
#
-0.1660,  0.4406,  0.1096, -0.3953, -0.7065, 0.4840
 0.0776, -0.1616, -0.0045, -0.5911,  0.7562, 0.1568
-0.9452,  0.3409, -1.5482,  0.1174, -0.7192, 0.8054
 0.9365, -0.3732,  1.5016,  0.7528,  0.7892, 0.1345
-0.8299, -0.9219, -2.5800,  0.7563, -0.8033, 0.7955
 0.0663,  0.3838,  0.5179,  0.3730,  0.6693, 0.3206
-0.9634,  0.5003, -1.4245,  0.4963, -0.4391, 0.7377
-0.1042,  0.8172,  0.6100, -0.4244, -0.7399, 0.4801
-0.9613,  0.3577, -1.5636, -0.4689, -0.0169, 0.6861
-0.7065,  0.1786, -1.2325, -0.7953, -0.1719, 0.5569
 0.3888, -0.1716,  0.6073,  0.0718,  0.3276, 0.2500
 0.1731,  0.8068,  1.1544, -0.7214,  0.6148, 0.3297
-0.2046, -0.6693, -1.0770, -0.3045,  0.5016, 0.2129
 0.2473,  0.5019,  0.9980, -0.4601,  0.7918, 0.2613
-0.1438,  0.9297,  0.6435,  0.2434, -0.7705, 0.5171
 0.1568, -0.1837,  0.1313,  0.8068,  0.1474, 0.3307
-0.9943,  0.2343, -1.7528,  0.0541,  0.7719, 0.5581
 0.2467, -0.9684, -0.4732,  0.3818,  0.9946, 0.1092
-0.6553, -0.7257, -2.0345,  0.3936, -0.8680, 0.7018
 0.8460,  0.4230,  2.1166, -0.9602, -0.9476, 0.1996
This entry was posted in Machine Learning, Scikit. Bookmark the permalink.

Leave a Reply