Example of Calculating the Gower Distance

The Gower distance is a metric that measures the dissimilarity of two items with mixed numeric and non-numeric data. Gower distance is also called Gower dissimilarity. One possible use of Gower distance is with k-means clustering with mixed data because k-means needs the numeric distance between data items.

Briefly, to compute the Gower distance between two items you compare each element and compute a term. If the element is numeric, the term is the absolute value of the difference divided by the range. If the element is non-numeric the term is 1 if the elements are different or the term is 0 if the elements are the same. The Gower distance is the average of the terms.

Suppose you have four data items where each item is a person. There are 6 elements: age, race, height, income, IsMale, politic. The elements age, height, and income are numeric. Elements race, IsMale, and politic are non-numeric.

     Age  Race  Height Income IsMale Politic
     (n)         (n)    (n)      
                  
[1]   22   1     3     0.39   TRUE   moderate
[2]   33   3     1     0.34   TRUE   liberal
[3]   52   1     2     0.51   FALSE  moderate
[4]   46   6     3     0.63   TRUE   conservative
                  
range 30   NA    2     0.29   NA     NA

The distance between person [1] and person [2] is 0.590, calculated like so:

      Age Race  Ht  Inc   Male  Politic
[1] = (22,  1,  3,  0.39, True, moderate)
[2] = (33,  3,  1,  0.34, True, liberal)

numeric: abs(diff) / range 
non-numeric: 0 if equal, 1 if different

dist([1], [2]) =

Age:     abs((22 - 33) / 30)       = 0.367
Race:    (different)               = 1
Height:  abs((3 - 1) / 2)          = 1.000
Inc:     abs((0.39 - 0.34) / 0.29) = 0.172
IsMale:  (same)                    = 0 
Politic: (different)               = 1

 = (0.367 + 1 + 1.000 + 0.172 + 0 + 1) / 6 
 = 3.539 / 6
 = 0.590

Noice that each individual term will be between 0.0 and 1.0 inclusive, therefore the Gower distance will always be between 0.0 and 1.0 where a distance of 0.0 means the two items are the same and a distance of 1.0 means the two items are as far apart as possible, relative to the source dataset.

The Gower distance can be used with purely numeric or purely non-numeric data, but for such scenarios there are better distance metrics available.

There are several variations of the Gower distance, so if you encounter it, you should read the documentation carefully. For example, for some scenarios you might want to weight each term to give more/less importance to that term.

You can’t measure the emotional distance between two people. Three paintings that illustrate this, including the famous “Nighthawks at the Diner” by Edward Hopper (1882 – 1967).

This entry was posted in Machine Learning. Bookmark the permalink.

2 Responses to Example of Calculating the Gower Distance

Mary Elizabeth Clinton says:

October 20, 2021 at 8:21 am

Thanks for the post! Do you have a recommendation for purely non-numeric (categorical, ordinal) variables? 🙂

Loading...
jamesdmccaffrey says:

October 21, 2021 at 6:27 am

This is very tricky. For datasets that contain purely non-numeric data, in most scenarios my colleagues and I use a neural autoencoder. First you convert binary data to 0 or 1 (or possibly -1 or +1) and you convert categorical data to one-hot encoded (for example, if “color” can be “red”, “blue”, or “green” then red = (1,0,0), blue = (0,1,0), green = (0,0,1)). Then you feed the all the data items to an autoencoder which will create a purely numeric representation of each data item. Then you can use numeric representations to compute a difference (typically Euclidean distance).

Years ago, I looked at problems like this very closely, for several years. It’s incredibly tricky.

See also https://jamesmccaffrey.wordpress.com/2021/10/04/computing-the-similarity-between-two-machine-learning-datasets-in-visual-studio-magazine/ — for computing the difference between datasets (as opposed to the difference between two items).

Loading...