Lightweight Spelling Correction Using Python With pyspellchecker and wordninja

One of my current work projects involves natural language processing. I start with a source scanned PDF file from a fax that is very crude, then use the PyMuPDF library module to programmatically extract the text. The extracted text is very rough. In particular, there are many spelling errors due to the nature of the behind-the-scenes optical character recognition, and many of the words are concatenated (in other words, spaces are dropped).

I initially thought that the OpenAI (aka ChatGPT) API would be able to correct the extracted text, but my experiments were not very successful. So, one weekend, I figured I’d explore creating a lightweight spelling correction tool. I knew going in that the problem was going to be very difficult. And it was.

Here’s the output of my demo:

C:\Python\ChatGPT\SpellCheck: python spell_check_demo.py

Reading source file into memory
Done

Source text:
This isanexample of an "Image-bised PDF" (also known as
image-only PDFs).
Image-based PDFs aretypically created \nthrough   scanning
pper in acopier, taknig 
photographs or taking screenshots.

Tokenizing source text
Done

Raw tokenized text:
['This', 'isanexample', 'of', 'an', 'Image', 'bised',
 'PDF', 'also', 'known', 'as', 'image', 'only', 'PDFs',
 'Image', 'based', 'PDFs', 'aretypically', 'created',
 'through', 'scanning', 'pper', 'in', 'acopier',
 'taknig', 'photographs', 'or', 'taking', 'screenshots']

Possible bad words:
{'pper', 'aretypically', 'isanexample', 'acopier',
 'taknig', 'bised'}

Replacing bad words and separating combined words
Done

Corrected text:
This is an example of an "Image-based PDF" (also known as
image-only PDFs). Image-based PDFs atypically created
through scanning paper in copier, taking photographs or
taking screenshots.

C:\Python\ChatGPT\SpellCheck:

You can see there’s a few glitches, but overall the spelling correction is good enough for my project purposes.

After reading the source text into memory as a giant string, I use a simple regular expression to tokenize the source string into words. My goal is to keep everything as simple as possible, at the cost of slightly lower accuracy.

I use the pyspellchecker library module to identify bad words, and replace those bad words with the most likely correction. For example, “pper” could easily be “paper” or “piper” or “upper” or “pepper”.

My real source PDF files has hundreds of words, mostly people’s names, that must be manually added into the pyspellchecker vocabulary.

The pyspellchecker cannot handle concatenated words like “isanexample” and so I use the wordninja library module to try and separate concatenated words — “is an example”. I looked at some more complex examples of concatenated words, and wordninja did not handle them well, so I’m skeptical about how useful wordninja is for my problem scenario.

I’ve worked with NLP systems before and without exception, they are extremely difficult. Complete accuracy is never possible except in artificially small demo scenarios. I’m primarily a mathematician and numbers guy, where accuracy is the rule rather than the exception, and so working with NLP is a bit uncomfortable for me. But for the project I’m working on, it is what it is.

Except for artificially small and simple source text, my short demo isn’t a practical system. But it gave me good insights into the challenges of NLP spell checking. A very difficult problem.

I work best with mathematics and numbers. I sometimes feel uneasy when I work with natural language processing because of the inexact nature and ambiguity of the field. Art is inexact and ambiguous too.

Here are three examples of art by Mitchell Hooks (1923-2013), one of the most famous illustrators of the 1960s. His style is sort of impressionistic and just screams “1960s” to me (in a good way). Hooks did the art for the movie poster for “Dr. No” (1962), the first James Bond movie.

Demo source text:

This isanexample of an “Image-bised PDF” (also known as 
image-only PDFs).  
Image-based PDFs aretypically created \nthrough   scanning 
pper in acopier, taknig 
photographs or taking screenshots.

Demo program:

# spell_check_demo.py

# pip install pyspellchecker
# pip install wordninja

from spellchecker import SpellChecker
import wordninja  # combined words
import re  # multiple spaces

# 1. prepare spell checker
spell = SpellChecker(case_sensitive=False)
spell.word_frequency.load_words(['pdf', 'pdfs', 'screenshots'])

# 2. load source file
print("\nReading source file into memory ")
fp = "./BadTextFiles/short_bad_example.txt"
with open(fp, "r", encoding="utf-8") as file:
  file_content = file.read()
print("Done ")

print("\nSource text: ")
print(file_content)

# 3. simplify the text string
file_content = re.sub(r'\s+', ' ', file_content).strip() 
# file_content = file_content.replace("-", " ")
file_content = file_content.replace("“", "\"")
file_content = file_content.replace("”", "\"")
file_content = file_content.replace(r"\n", "")

# 4. split source string into words
print("\nTokenizing source text ")
# raw = file_content.split(" ")
raw = re.findall(r'\w+', file_content)
print("Done ")

print("\nRaw tokenized text: ")
print(raw)

# 5. find possible misspelled and combined words
possible_bads = spell.unknown(raw)
print("\nPossible bad words: ")
print(possible_bads)

# 6. replace normal misspelled words, then combined
print("\nReplacing bad words and separating combined words ")
for poss_bad in possible_bads:
  correction = spell.correction(poss_bad)
  if correction is not None:
    file_content = \
      file_content.replace(poss_bad, correction)
  elif correction is None:
    seperated = wordninja.split(poss_bad)  # combined words?
    if len(seperated) >= 2:
      # deal with combined words
      correction = ""
      for w in seperated:
        correction += w + " "
      correction = correction.strip()  # trailing space
      file_content = \
        file_content.replace(poss_bad, correction)
print("Done ")

print("\nCorrected text: ")
print(file_content)