In machine learning, if training data is multicollinear, the resulting model will likely be poor. The most common way to analyze training data for multicollinearity is to compute the VIF (variance inflation factor) for each column of the data.
VIF is a value between 1.0 and positive infinity (well in weird scenarios, a VIF value could be less than one). Briefly, if all column VIF values are less than about 7.0, the data is probably OK.
if VIF is close to 1.0, the column is not correlated.
if VIF between 1.0 and 5.0, column is mildly correlated
if VIF between 5.0 and 10.0, column is highly correlated
if VIF greater than 10.0, column is extremely correlated
To compute the VIF for a specified column of training data, you use the specified column as the dependent y variable, and use the remaining columns as the independent predictor variables, and compute a linear regression model, and then compute the R2 (coefficient of determination) for the model. The VIF value for the column is 1.0 / (1.0 – R2).
Suppose that you have a set of training data X predictor values, and you use some column c as the dependent y variable, and all the other columns as predictors for c. After training the linear regression model, you compute R2 and it is 0.90 — which means column c is predicted very well by the other columns. The VIF value for column c is 1.0 / (1.0 – R2) = 1.0 / 0.10 = 10.0 which is large which is bad because column c is a linear combination of the other columns — the data is somewhat multicollinear. Now, with the same setup, suppose R2 is 0.20 — which means column c cannot be predicted well by the other columns. The VIF value is 1.0 / (1.0 – 0.20) = 1.0 / 0.8 = 1.25 which is a small value, which is good, because column c is not a linear combination of the other columns, and therefore the data is not multicollinear.
I put together a demo using Python and the scikit library. I created two datasets. The first data set has five columns of predictors, followed by a column of target y values. The data is “normal” in the sense that there’s no multicollinearity. There are 20 items. It looks like:
-0.1660, 0.4406, -0.9998, -0.3953, -0.7065, 0.4840 0.0776, -0.1616, 0.3704, -0.5911, 0.7562, 0.1568 -0.9452, 0.3409, -0.1654, 0.1174, -0.7192, 0.8054 . . .
The second dataset is highly multicollinear, where the third column is 2 times the first column, plus the second column, plus a small random value between 0.000 and 0.001. It looks like:
-0.1660, 0.4406, 0.1096, -0.3953, -0.7065, 0.4840 0.0776, -0.1616, -0.0045, -0.5911, 0.7562, 0.1568 -0.9452, 0.3409, -1.5482, 0.1174, -0.7192, 0.8054 . . .
The output of the demo program is:
Begin variance inflation factor demo Loading synthetic (20) normal data First two X: [-0.1660 0.4406 -0.9998 -0.3953 -0.7065] [ 0.0776 -0.1616 0.3704 -0.5911 0.7562] Begin VIF analysis col = 0 | vif = 1.1980 col = 1 | vif = 1.4591 col = 2 | vif = 1.2345 col = 3 | vif = 1.3025 col = 4 | vif = 1.2120 Loading synthetic (20) multicollinear data (col[2] = 2.0 * col[0] + col[1] + rnd) First two X: [-0.1660 0.4406 0.1096 -0.3953 -0.7065] [ 0.0776 -0.1616 -0.0045 -0.5911 0.7562] Begin VIF analysis col = 0 vif = 25546262.9389 col = 1 vif = 6023299.3886 col = 2 vif = 30951889.1370 col = 3 vif = 1.2937 col = 4 vif = 1.2117 End demo
As expected, the first dataset didn’t have any bad VIF values, but the VIF values for the second dataset show that columns [0], [1], [2] are highly correlated.
No moral to this blog post. Just an interesting exploration.

In machine learning, you don’t want a relationship between two columns in your training data. But in science fiction movies, you absolutely want a good relationship between the hero and the main actress.
I’m a huge fan of science fiction movies from the 1950s and 1960s. Here are posters of two films that were good, but they could have been great if the chemistry between the hero and the main lady were better.
Left: In “Crack in the World” (1965), scientists create a project to drill to the Earth’s magma center to gain a source of unlimited heat, and therefore unlimited energy. The plan involves firing a thermonuclear missile into a hole. This was not a good idea, to put it mildly. The chemistry between Dr. Rampion (actor Kieron Moore) and the wife of his boss, Dr. Sorensen (actress Janette Scott) was, well, one with no chemistry. But I give the movie a B grade anyway.
Right: In “The Day the Earth Caught Fire” (1961), The U.S. and the Soviets unknowingly explode nuclear test weapons at the same time on the same day. This was not a good idea, to put it mildly. The Earth is knocked out of orbit, towards the Sun. Only exploding every nuclear device on the planet simultaneously might save humanity. The chemistry between newspaper reporter Peter Stenning (actor Edward Judd) and office worker Jeannie Craig (actress Janet Munro — she’s one of my sci fi favorites) was awkward and unconvincing. But I give the movie a B- grade anyway.
Demo program:
# variance_inflation_factor.py
import numpy as np
from sklearn.linear_model import LinearRegression
np.set_printoptions(precision=4, suppress=True,
floatmode='fixed')
def vif(data, i):
# vif = 1.0 / (1.0 - R2) if col [i] is dependent variable
X = np.delete(data, i, axis=1) # all cols except i
y = data[:,i]
model = LinearRegression()
model.fit(X, y)
r2 = model.score(X, y)
result = 1.0 / (1.0 - r2)
return result
# -----------------------------------------------------------
print("\nBegin variance inflation factor demo ")
print("\nLoading synthetic (20) normal data ")
train_X = \
np.loadtxt(".\\Data\\synthetic_train_20.txt",
usecols=[0,1,2,3,4], delimiter=",")
print("\nFirst two X: ")
for i in range(2):
print(train_X[i])
print("\nBegin VIF analysis ")
for c in range(len(train_X[0])):
z = vif(train_X, c)
print("col = %3d | vif = %0.4f " % (c, z))
print("\nLoading synthetic (20) mulicollinear data ")
print("(col[2] = 2.0 * col[0] + col[1] + rnd) ")
train_X = \
np.loadtxt(".\\Data\\synthetic_train_20_collinear.txt",
usecols=[0,1,2,3,4], delimiter=",")
print("\nFirst two X: ")
for i in range(2):
print(train_X[i])
print("\nBegin VIF analysis ")
for c in range(len(train_X[0])):
z = vif(train_X, c)
print("col = %3d vif = %0.4f " % (c, z))
print("\nEnd demo ")
First, normal, dataset:
# synthetic_train_20.txt # -0.1660, 0.4406, -0.9998, -0.3953, -0.7065, 0.4840 0.0776, -0.1616, 0.3704, -0.5911, 0.7562, 0.1568 -0.9452, 0.3409, -0.1654, 0.1174, -0.7192, 0.8054 0.9365, -0.3732, 0.3846, 0.7528, 0.7892, 0.1345 -0.8299, -0.9219, -0.6603, 0.7563, -0.8033, 0.7955 0.0663, 0.3838, -0.3690, 0.3730, 0.6693, 0.3206 -0.9634, 0.5003, 0.9777, 0.4963, -0.4391, 0.7377 -0.1042, 0.8172, -0.4128, -0.4244, -0.7399, 0.4801 -0.9613, 0.3577, -0.5767, -0.4689, -0.0169, 0.6861 -0.7065, 0.1786, 0.3995, -0.7953, -0.1719, 0.5569 0.3888, -0.1716, -0.9001, 0.0718, 0.3276, 0.2500 0.1731, 0.8068, -0.7251, -0.7214, 0.6148, 0.3297 -0.2046, -0.6693, 0.8550, -0.3045, 0.5016, 0.2129 0.2473, 0.5019, -0.3022, -0.4601, 0.7918, 0.2613 -0.1438, 0.9297, 0.3269, 0.2434, -0.7705, 0.5171 0.1568, -0.1837, -0.5259, 0.8068, 0.1474, 0.3307 -0.9943, 0.2343, -0.3467, 0.0541, 0.7719, 0.5581 0.2467, -0.9684, 0.8589, 0.3818, 0.9946, 0.1092 -0.6553, -0.7257, 0.8652, 0.3936, -0.8680, 0.7018 0.8460, 0.4230, -0.7515, -0.9602, -0.9476, 0.1996
Second, multicollinear, dataset:
# synthetic_train_20_collinear.txt # col [2] = 2*[0] + [1] + rand(0.001) # -0.1660, 0.4406, 0.1096, -0.3953, -0.7065, 0.4840 0.0776, -0.1616, -0.0045, -0.5911, 0.7562, 0.1568 -0.9452, 0.3409, -1.5482, 0.1174, -0.7192, 0.8054 0.9365, -0.3732, 1.5016, 0.7528, 0.7892, 0.1345 -0.8299, -0.9219, -2.5800, 0.7563, -0.8033, 0.7955 0.0663, 0.3838, 0.5179, 0.3730, 0.6693, 0.3206 -0.9634, 0.5003, -1.4245, 0.4963, -0.4391, 0.7377 -0.1042, 0.8172, 0.6100, -0.4244, -0.7399, 0.4801 -0.9613, 0.3577, -1.5636, -0.4689, -0.0169, 0.6861 -0.7065, 0.1786, -1.2325, -0.7953, -0.1719, 0.5569 0.3888, -0.1716, 0.6073, 0.0718, 0.3276, 0.2500 0.1731, 0.8068, 1.1544, -0.7214, 0.6148, 0.3297 -0.2046, -0.6693, -1.0770, -0.3045, 0.5016, 0.2129 0.2473, 0.5019, 0.9980, -0.4601, 0.7918, 0.2613 -0.1438, 0.9297, 0.6435, 0.2434, -0.7705, 0.5171 0.1568, -0.1837, 0.1313, 0.8068, 0.1474, 0.3307 -0.9943, 0.2343, -1.7528, 0.0541, 0.7719, 0.5581 0.2467, -0.9684, -0.4732, 0.3818, 0.9946, 0.1092 -0.6553, -0.7257, -2.0345, 0.3936, -0.8680, 0.7018 0.8460, 0.4230, 2.1166, -0.9602, -0.9476, 0.1996

.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
Visual Studio Live
Microsoft MLADS Conference
DevIntersection Conference
Machine Learning Week
Ai4 Conference
G2E Conference
iSC West Conference
You must be logged in to post a comment.