Lightweight Sentence Similarity Using GloVe Embeddings

I was annoyed.

It all started one evening when I couldn’t get to sleep. So I decided to code a sentence similarity demo using the new-to-me SentenceTransformers code library. See https://sbert.net/. My first step was to install the library using the command “pip install sentence-transformers”. What could go wrong?

Well, about four hours of aggravation later, my machine’s Python-PyTorch installation was messed up beyond repair and I had to uninstall and then reinstall everything. The SentenceTransformers library has a crazy number of crazy-complicated dependencies that I didn’t come close to solving. Grr. Trying to install SentenceTransformers was an unmitigated disaster.

So, to salvage my pride, I decided to implement the simplest-possible sentence similarity demo, from scratch, using plain Python. The overall idea is simple. If you have two sentences s1 and s2, convert each into a numeric embedding vector, and then compute and return the cosine similarity between the two embedding vectors.

After a couple of hours, I had a rudimentary demo up and running and I felt vindicated (but not entirely satisfied). The output of the demo is:

C:\Python\SentenceSimilarity: set PYTHONHASHSEED=1
C:\Python\SentenceSimilarity: python sentence_similarity.py

Begin lightweight sentence similarity using GloVe

s1 = The weather is lovely today
s2 = It's so sunny outside
s3 = He drove to the stadium

Creating sentence similarity object . . .
GloVe parse failure
GloVe parse failure
. . .
GloVe parse failure
Done

Similarity between s1 and s2 = 0.1473
Similarity between s1 and s3 = 0.2839
Similarity between s2 and s3 = 0.1571

End demo

For cosine similarity, larger magnitude values mean more similar, and so for this example, sentences s1 and s3 are most similar. This doesn’t correspond to human intuition, because s1 and s2 are both related to weather. But sentence similarity isn’t entirely about sentence meaning. As it turns out, s1 and s3 are most similar because they both have the same number of words and I used dummy padding for s2.

I needed a way to convert a sentence from words into a numeric embedding vector. One possibility is to use the gensim Word2Vec library, but this requires a large corpus of training data. Instead, I decided to keep things as simple as possible by using the old GloVe (Global Vectors for Word Representation) pre-trained word embeddings from nlp.stanford.edu/projects/glove/.

There are several GloVe files. I selected the massive “Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03GB download): glove.840B.300d.zip” file. It has 2.2 million words in its vocabulary and each word is mapped to a vector of size 300. I unzipped the download to file “glove.840B.300d.txt”.

Although the idea of sentence similarity is simple, as usual when dealing with natural language processing, there were dozens of very tricky details to attend to.

* When creating a dictionary to map a word like “the” to an embedding vector with 300 values like [0.1234, 0.9876, . . .], some of the GloVe entries were difficult to parse so I just skipped over them with an error message. For example, almost all of the 2.2 million vocabulary items are just words, but there are a handful of terms with spaces, such as “. . .” (three periods with spaces between), and these generated a parsing error.

* All of the smaller GloVe versions are uncased, i.e., all lower case, and so when using the small versions of GloVe, the source sentences must be converted to all lower case.

* I used the large, cased version of GloVe (i.e., words in both upper case and lower case) so “March” (the month) and “march” (a kind of walking) are both in the data file. But the huge size means constructing the embedding dictionary from the data file was painfully slow — about five minutes. I didn’t deal with the performance issue and just lived with slow loading.

* When computing the cosine similarity between two embedding vectors, both vectors must have the same length. This is surprisingly tricky to deal with so I used the simplest approach of padding the shorter embedding to the size of the larger, using 0.0 values. But this gives a similarity advantage to sentences with the same number of words.

* To get reproducible results, I had to set the PYTHONHASHSEED environment variable before running the demo. There’s no way to programmatically set it. Annoying as all heck.

* My demo doesn’t deal with punctuation — commas, periods, exclamation points, dollar signs, and on and on. Dealing with punctuation is a big subproblem.

My rudimentary sentence similarity demo is simple but it doesn’t take into account word order. To deal with word order, I’d have to use a PyTorch TransformerEncoder module. But that’s a problem for another day.

Anyway, I learned a lot and that’s always satisfying.

Working with natural language processing (NLP) is difficult. Here are two examples of newspaper NLP gone wrong.

Demo code.

# sentence_similarity.py
# lightweight sentence similarity using GloVe embeddings
# from https://nlp.stanford.edu/projects/glove/

# before run on Windows, issue "set PYTHONHASHSEED=1" in cmd

import numpy as np

class SentenceSim:
  def __init__(self, glove_path, embed_dim, uncased):
    self.embed_dim = embed_dim
    self.uncased = uncased
    self.embed_dict = {}
    f = open(glove_path, "r", encoding="utf-8") # old school
    for line in f:
      tokens = line.split()
      try:
        word = tokens[0]  # fist item on line
        vec = np.asarray(tokens[1:], dtype=np.float32)
      except:
        print("GloVe parse failure ")
      else:
        self.embed_dict[word] = vec
    f.close()

  # ---------------------------------------------------------

  def similarity(self, s1, s2):
    # convert s1, s2 to lowercase if necessary
    # embed s1 and s2, padded to larger length
    # compute cosine similarity
    if self.uncased == True:  # must convert to lower
      s1_processed = s1.lower().replace("\'", "")
      s2_processed = s2.lower().replace("\'", "")
    else:
      s1_processed = s1
      s2_processed = s2

    tokens1 = s1_processed.split()
    tokens2 = s2_processed.split()
    n1 = len(tokens1)
    n2 = len(tokens2)
    n = max(n1, n2)
    embed1 = np.zeros(n * self.embed_dim, dtype=np.float32)
    embed2 = np.zeros(n * self.embed_dim, dtype=np.float32)

    k = 0
    for i in range(n1):
      w = tokens1[i]
      vec = self.embed_dict[w]
      for j in range(self.embed_dim):
        embed1[k] = vec[j]
        k += 1
    k = 0
    for i in range(n2):
      w = tokens2[i]
      vec = self.embed_dict[w]
      for j in range(self.embed_dim):
        embed2[k] = vec[j]
        k += 1

    return SentenceSim.cosine_sim(embed1, embed2)

  # ---------------------------------------------------------

  @ staticmethod
  def cosine_sim(v1, v2):
    # similarity not distance
    top = np.dot(v1, v2)
    bot = np.linalg.norm(v1) * np.linalg.norm(v2)
    return top / bot

# -----------------------------------------------------------

print("\nBegin lightweight sentence similarity using GloVe ")

s1 = "The weather is lovely today"
s2 = "It's so sunny outside"
s3 = "He drove to the stadium"

print("\ns1 = " + s1)
print("s2 = " + s2)
print("s3 = " + s3)

print("\nCreating sentence similarity object . . .")
ss = SentenceSim(".\\GloveData\\glove.840B.300d.txt",
 300, uncased=False)
# ss = SentenceSim(".\\GloveData\\glove.6B.50d.txt",
#  50, uncased=True)  # smaller version
print("Done ")

s1_s2 = ss.similarity(s1, s2)
print("\nSimilarity between s1 and s2 = %0.4f " % s1_s2)
s1_s3 = ss.similarity(s1, s3)
print("Similarity between s1 and s3 = %0.4f " % s1_s3)
s2_s3 = ss.similarity(s2, s3)
print("Similarity between s2 and s3 = %0.4f " % s2_s3)

print("\nEnd demo ")