An Example of Locality-Sensitive Hashing

I was working with differential privacy recently and the topic of locality-sensitive hashing (LSH) came up. The Wikipedia definition of locality-sensitive hashing is: “Locality-sensitive hashing is an algorithmic technique that hashes similar input items into the same buckets with high probability.”

Put another way, an LSH function accepts any kind of input (a numeric vector, a string, a text file, an image, etc.) and returns a single integer (bucket) value in such a way that similar input items return the same bucket value.

Even though it’s possible to write a generic LSH function that handles any kind of input (which requires the raw input to be converted to bytes), LSH functions are often program-defined and specific to different problem scenarios.

Here’s a concrete example of a custom LSH function. Suppose the inputs are numeric 2D vectors where each element is between 0.0 and 10.0, for example [2.0, 3.0] or [6.5, 0.4]. The maximum distance for any input vector to the origin at [0, 0] is sqrt(10^2 + 10^2) = sqrt(200) = 14.1421. And suppose you specify 3 buckets (0, 1, or 2). Define an LSH function as: If the computed distance is between [0.0, 5.0] return bucket 0, if between [5.0, 10.0] return bucket 1, if between [10.0, 15.0] return bucket 2.

With this design inputs of [0.0, 1.0] and [1.5, 1.5] both return bucket 0.

If you think about it, LSH can be thought of as a clustering algorithm where the bucket number is synonymous with cluster ID.

Locality-sensitive hashing is often used with text data which is much more difficult to work with than numeric data because it’s more difficult to compare text data than it is to compare numeric data. But the general principles are the same.

Last night I watched the 2021 movie version of “Dune” based on the 1965 novel by Frank Herbert. A cluster of three book covers. Left: Hardcover first edition (1965). Center: Paperback (1967). Right: Paperback (1984).

Demo code:

# lsh_demo.py

# Wikipedia: locality-sensitive hashing (LSH) is an
# algorithmic technique that hashes similar input items
# into the same "buckets" with high probability

import numpy as np

def lsh_bucket(x, n):
  # x is a 2D vector where each element is in [0.0, 10.0]
  # return is in [0, n-1] -- n buckets
  dist = np.sqrt(x[0]**2 + x[1]**2)  # Euclidean to [0,0]
  # max dist = sqrt(10^2 + 10^2) = sqrt(200) = 14.1421
  delta = 15.0 / n
  for i in range(n):
    if dist < delta * (i+1):
      return i
  return n-1  # 
   
print("\nBegin locality-sensitive hashing (LSH) demo \n")

x = np.array([0.0, 0.0])
bucket = lsh_bucket(x, 4)
print("x = ", end=""); print(x)
print("bucket = %d \n" % bucket)

x = np.array([4.0, 5.0])
bucket = lsh_bucket(x, 4)
print("x = ", end=""); print(x)
print("bucket = %d \n" % bucket)

x = np.array([10.0, 10.0])
bucket = lsh_bucket(x, 4)
print("x = ", end=""); print(x)
print("bucket = %d \n" % bucket)

print("End demo ")