Example of Kernel Density Estimation (KDE) Using SciPy

I was chit-chatting with a colleague a couple of days ago and the topic of kernel density estimation (KDE) came up. The first hurdle when trying to understand KDE is figuring out exactly what kind of problem the technique solves. Briefly, suppose you have a set of data points, for example the heights of 21 men, normalized by subtracting the average height and dividing by the standard deviation of the heights. The source data might look like [1.76, 0.40, 0.98, . . -2.55]. KDE is a classical statisitcs technique that can determine an estimated probability density function (PDF) from which your source data might have come from.

The blue-bars histogram is the source data. The red line is the estimated PDF function computed from the source data.

A probability density function is a curve where the total area under the curve is 1.

Just for statistical hoots, I coded up a quick demo using the stats.gaussian_kde() function from the SciPy library. There are many ways to estimate a PDF. The “gaussian” in the name of the SciPy function indicates that many Gaussian kernel functions are used behind the scenes to determine the estimated PDF function.

In my demo, I hard-coded 21 data points that were loosely Gaussian distributed then used the stats.gaussian_kde() function to estimate the distribution from which the 21 data points were drawn. The graph looked pretty good.

I’ve been studying classical statistics and computational statistics for many years. I’ve never run into a real-life problem where I needed KDE but you never know — it’s important to have as many tools in your technical skillset as possible.

Minimalist portraits are an estimate of reality. Left: By artist Jules Tillman. Center: By artist Malika Favre. Right: By artist Rokas Aleliunas.

# kde_demo.py

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

print("\nBegin kernel density estimation demo ")

np.random.seed(0)

x_data = np.array([
  1.76,  0.40,  0.98,  2.24,  1.87, -0.98,
  0.95, -0.15, -0.10,  0.41,  0.14,  1.45,
  0.76,  0.12,  0.44,  0.33,  1.49, -0.21,
  0.31, -0.85, -2.55])
print("\nSource data points (normal): ")
print(x_data)

print("\nGenerating estimated PDF function from source x_data ")
gkde_obj = stats.gaussian_kde(x_data)

x_pts = np.linspace(-4, +4, 41)
print("\nFeeding points to KDE estimated PDF: ")
estimated_pdf = gkde_obj.evaluate(x_pts)

# print("\nEstimated y data points from KDE: ")
# print(estimated_pdf)

y_normal = stats.norm.pdf(x_pts)

plt.figure()
plt.hist(x_data, bins=7, density=1.0)
plt.plot(x_pts, estimated_pdf, label="kde estimated PDF", \
 color="r")
plt.legend()
plt.show()

print("\nEnd demo ")