Example of Calculating the Energy Distance Between Two Samples

I stumbled across an interesting idea called energy distance. Energy distance is a number that is a measure of the distance between two distributions. There are many other ways to define the distance between two distributions. The Kullback-Leibler divergence is one example.

Suppose you have two distributions, X and Y, where each item is a vector with 4 values (so dim = 4). You draw n = 3 samples from the first distribution and m = 2 samples from the second distribution. If X and Y are:

X =
 [0.1  0.5  0.3  0.1 ]
 [0.3  0.6  0.0  0.1 ]
 [0.0  0.8  0.05 0.15]

Y =
 [0.1  0.3  0.2  0.4 ]
 [0.05 0.25 0.4  0.3 ]

then the energy distance between X and Y is 0.8134 — maybe (see below).

I was motivated to explore the idea of energy distance because I did some Internet searches for examples and found literally no examples. Lack of information such as this always intrigues me.

The Wikipedia page on energy distance feels like it was written by the inventor. It wasn’t much help for me as an implementation guide.

The inventor of the idea of energy distance is a mathematician named G.J. Szekely. The few research papers I found were all written by him, and the Wikipedia article on energy distance looks like it was entirely written by him too. Somewhat of a red flag.

If you have a sample X and a sample Y, you must compute Euclidean distances (or any other vector distance measure) between all pairs of X items, distances between all pairs of Y items, and distances between all pairs of X and Y items. Then you compute the average distance between all X items, the average distance between all Y items, and the average distance between all X-Y pairs. Then energy distance is:

sqrt[ (2 * avg_xy) - avg_xx - avg_yy ]

At least I think this is how energy distance is computed, based on the research papers I read. Notice that energy distance won’t scale well to large datasets because of the huge number of potential distance calculations.

I concluded that energy distance is perhaps somewhat of a vanity project — an interesting idea but one that doesn’t appeal to anyone other than the inventor. Many times, new ideas are useful and valid, but they don’t provide a big enough advantage over existing techniques. Maybe energy distance could be useful, but the research papers are written in a style that only deep experts can understand — and so nobody will take the ideas behind energy distance and popularize them for data scientists.

I coded up a demo using Python. But my demo could be quite wrong because I had very little to go on. (Note: my demo is not efficient in the sense that is computes both dist(x,y) and dist(y,x) which are the same for Euclidean distance).

Vanity Fair Magazine started publication in 1913. The magazine content was folded into Vogue Magazine for several years. Left: The June 1914 issue, just a few days before the start of World War I in July. Center: The July 1929 issue, just a few days before the start of the Great Depression in August 1929. Right: The September 1941 issue, just a few weeks before the U.S. entry into World War II on December 7, 1941.

It seems like people have a remarkable ability to overcome adversity and bounce back stronger than ever with renewed energy.

Code below. Long.

# energy_dist_demo.py
# https://en.wikipedia.org/wiki/Energy_distance
# Szekely, G.J., (2002),
# "E-statistics: The Energy of Statistical Samples" 

import numpy as np

def euc_dist(v1, v2):
  return np.sqrt(np.sum((v1 - v2)**2)) 

def energy_dist(Dxy, Dxx, Dyy):
  n = len(Dxx); m = len(Dyy)

  sumxy = 0.0 
  for i in range(n):
    for j in range(m):
      sumxy += Dxy[i][j]
  avg_xy = sumxy / (n * m)

  sumxx = 0.0 
  for i in range(n):
    for j in range(n):
      sumxx += Dxx[i][j]
  avg_xx = sumxx / (n * n) 

  sumyy = 0.0 
  for i in range(m):
    for j in range(m):
      sumyy += Dyy[i][j]
  avg_yy = sumyy / (m * m)  

  return np.sqrt( (2 * avg_xy) - avg_xx - avg_yy )

def main():
  print("\nBegin energy distance example ")
  np.set_printoptions(precision=4, suppress=True)

  dim = 4

  X = np.array(
    [[0.10, 0.50, 0.30, 0.10],
     [0.30, 0.60, 0.00, 0.10],
     [0.00, 0.80, 0.05, 0.15]], dtype=np.float32)
  n = len(X)

  Y = np.array(
    [[0.10, 0.30, 0.20, 0.40],
     [0.05, 0.25, 0.40, 0.30]], dtype=np.float32) 
  m = len(Y)  # 2

  print("\nX = "); print(X)
  print("\nY = "); print(Y)

  Dxx = np.zeros((n,n), dtype=np.float32)
  for i in range(n):
    for j in range(n):
      Dxx[i][j] = euc_dist(X[i], X[j])
  print("\nDxx Euclidean distances: ")
  print(Dxx) 

  Dyy = np.zeros((m,m), dtype=np.float32)
  for i in range(m):
    for j in range(m):
      Dyy[i][j] = euc_dist(Y[i], Y[j])
  print("\nDyy Euclidean distances: ")
  print(Dyy) 

  Dxy = np.zeros((n,m), dtype=np.float32)
  for i in range(n):
    for j in range(m):
      Dxy[i][j] = euc_dist(X[i], Y[j])
  print("\nDxy Euclidean distances: ")
  print(Dxy)  

  e_dist = energy_dist(Dxy, Dxx, Dyy)
  print("\nEnergy distance between X and Y: ")
  print("%0.4f" % e_dist)

  print("\nEnd demo ")

if __name__ == "__main__":
  main()

2 Responses to Example of Calculating the Energy Distance Between Two Samples

sau001 says:

August 8, 2021 at 3:53 pm

For a brief moment this made me think about Earth Mover Distance

Loading...
Zhixun He (@ZhixunHe) says:

April 14, 2022 at 3:09 pm

I really admire your effort to try to understand the concept, read them and share with us. I couldn’t agree more with you in regard to the horrible wording of math in Wikipedia. I couldn’t find it less useful. The reader who goes to those wikipedia pages most likely are someone who don’t have deep knowledge for that, however, those wikipedia math pages just try their best to make things as confused as possible. It really makes me feel it’s out of their vanity to write things that way. Thus, ever since then, I rarely donate wikipedia…

Loading...