I was reading a research paper recently and it mentioned the Earth Mover’s Distance, also known as the Wasserstein metric or Wasserstein distance. I’ll refer to it as the Wasserstein metric. The Wasserstein metric is a measure of the difference between two distributions. If two distributions are identical, their Wasserstein metric is zero. The more different two distributions are, the larger the value of the Wasserstein metric. If one of two distributions is considered ground truth, the Wasserstein metric can interpreted as a measure of error.
A couple of years ago I wrote up a blog post that showed an example calculation of the Wasserstein metric for two different distributions where the X values had two dimensions, such as (3,4). The research paper I was reading had X values that are just single values. So, I decided to write up this post, another example calculation, this time with 1-D X data.
The calculation is best explained by the graphs below. The top distribution, is the “dirt” distribution. The bottom distribution is the “holes” distribution. Both distributions are probability distributions because the total area in the bars sum to 1. The Wasserstein metric is the amount of work needed to transfer the dirt to the holes. Work is the amount of information moved (“flow”) times the distance moved.
I labeled the bars in the top dirt distribution as A, B, C, D just to reference them. The bars in the bottom holes distribution are labeled R, S, T. The Wasserstein distance metric is 2.20, calculated as follows:
(step) from to flow dist work
1. A R 0.2 1 0.20
2. B R 0.1 0 0.00
3. C R 0.2 3 0.60
4. C S 0.1 2 0.20
5. D S 0.2 4 0.80
6. D T 0.2 2 0.40
----- ------
1.0 2.20
Wasserstein distance = total work / total flow
= 2.20 / 1.0
= 2.20
1. all 0.2 in A is moved to R, using up A, with R needing 0.3 more.
2. all 0.1 in B is moved to R, using up B, with R needing 0.2 more.
3. just 0.2 in C is moved to R, filling R, leaving 0.1 left in C.
4. all remaining 0.1 in C is moved to S, using up C, with S needing 0.2 more.
5. 0.2 in D is moved to S, filling S, leaving 0.2 left in D.
6. all remaining 0.2 in D is moved to T, using up D, filling T.
The Wasserstein metric is nicely symmetric in the sense that if you reverse the meaning of which distribution is “dirt” and which is “holes”, you get the same result.

The scipy library has a Wasserstein function.
Update: The scipy library Wasserstein default-parameter interface requires you to awkwardly specifiy the empirical distributions rather than a pair of frequency distributions, but I discovered you can use a work-around call to specify distributions:
There are actually different versions of the Wasserstein metric. The Earth Mover’s Distance is Wasserstein with p = 1, usually denoted as W1 or 1-Wasserstein. The 2-Wasserstein metric is computed like 1-Wasserstein, except instead of summing the work values, you sum the squared work values and then take the square root. The 3-Wasserstein would be the cube root of the sum of cubed work values, and so on. And you can compute the Wasserstein metric on discrete distributions (such as the example shown in this post) and also on continuous distributions.
There are many other ways to measure the difference, or similarity, between two probability distributions, including the Kullback–Leibler divergence and the Hellinger distance.

You can’t measure the emotional distance between two people, but I like to think that the distance between two people in love is really close to zero.


.NET Test Automation Recipes
Software Testing
SciPy Programming Succinctly
Keras Succinctly
R Programming
2026 Visual Studio Live
2025 Summer MLADS Conference
2026 DevIntersection Conference
2025 Machine Learning Week
2025 Ai4 Conference
2026 G2E Conference
2026 iSC West Conference
Pingback: Computing the Similarity of Machine Learning Datasets | James D. McCaffrey
Hi,
Given this from Wikipedia’s entry on the EMD,
“The above definition is valid only if the two distributions have the same integral (informally, if the two piles have the same amount of dirt), as in normalized histograms or probability density functions. ”
I would’ve expected either a) you’d need to normalize both “dirt” and “holes”, or at least that they would sum up to the same value, aka “same amount of dirt”. The scipy page didn’t enlighten.
What am I missing?
Thanks in advance!
You are correct that dirt and holes have to sum to the same value. The scipy function will normalize for you by converting to frequencies so that all values sum to 1.0.