Programmatically Determine if a PDF Document is Scanned or Digital Using Python

Summary: I wrote a Python function to determine if a PDF document is scanned or digital. I attempted to extract the document text using the PyMuPDF library; if successful, the document is digital, if fail, the document is scanned.

One of my work projects involves programmatically extracting the text from thousands of PDF documents. A PDF document can be scanned (typically from a fax or basic copier machine) or digital (typically from an application like Adobe or Word, or from a copier machine that has optical character recognition). It’s useful to know if a PDF is scanned or digital because digital PDF files are much easier to woth with than scanned PDF files.

The Python language PyMuPDF library is my library of choice for working with PDF documents. The library has a page.get_text() function that attempts to extract the text from a PDF page. The function only works with digital PDF files. If the function is fed a scanned PDF, the extracted text is either empty or just a few words relative to the size of the source file. Note: PyMuPDF also has a page.get_textpage_ocr() function that does work with scanned PDF files.

So, my idea is to just attempt to extract text using the page.get_text() function. If little or no text is extracted, the source PDF file is probably scanned. If a lot of text is extracted, the source PDF is probably digital.

The output of one demo run:

Begin determine if PDF is scanned

Checking if ./scanned_example.pdf is scanned
True
Checking if ./scanned_example.pdf is digital
False

Checking if ./digital_example.pdf is scanned
False
Checking if ./digital_example.pdf is digital
True

End demo

To test my function, I found an example scanned PDF document online, and I created a digital PDF document using Word.

Good fun.

Determining if a PDF file is digital or scanned is a technical problem.

One of my favorite fantasy movies is “The Dark Crystal” (1982). Gelflings Jen (the male in spite of his name) and Kira must travel to the fortress headquarters of the evil Skeksis bird-like creatures and repair the all-powerful crystal to prevent the Skeksis from ruling forever.

I love the many details in the movie.

In an early scene in a swamp, in the background, a small lizard-like creature is chasing an insect-like creature, when the lizard thing is snapped up and eaten by a camouflaged creature that looks like a stonefish from Hell. I always wondered if the creature was a plant or an animal. I recently discovered that the creature is a “Phillite” — an ambush predator animal. I learned this from an online version of the book “The Dark Crystal Bestiary: The Definitive Guide to the Creatures of Thra” by Adam Cesare and Iris Compiet.

Demo program. Replace “lt” with the less-than Boolean operator symbol.

# determine_scanned_pdf.py

# programmatically determine if PDF file is scanned or digital
# idea: use PyMuPDF get_text() to attempt to extract text from
#   source PDF. If little to no text is extracted (failure), PDF 
#   is likely scanned. If lots of text is extracted, PDF is
#   probably digital.

# pip install PyMuPDF

import pymupdf  # aka fitz
import os

# -----------------------------------------------------------

def is_pdf_scanned(pdf_path):
  try:
    size_in_bytes = os.path.getsize(pdf_path)

    doc = pymupdf.open(pdf_path)
    txt = ""
    for page in doc:
      txt += page.get_text()
    doc.close()

    # if little to no text extracted, PDF is scanned
    if len(txt) == 0 or len(txt) "lt" 0.01 * size_in_bytes:
      return True
    else:
      return False  # much text extracted so pdf is digital

  except Exception as e:
    print("Error: " + str(e))

# -----------------------------------------------------------

def is_pdf_digital(pdf_path):
  return not is_pdf_scanned(pdf_path)

# -----------------------------------------------------------

print("\nBegin determine if PDF is scanned ")

pdf_path = "./scanned_example.pdf"
print("\nChecking if " + str(pdf_path) + " is scanned ")
is_scanned = is_pdf_scanned(pdf_path)
print(is_scanned)
print("Checking if " + str(pdf_path) + " is digital ")
is_digital = is_pdf_digital(pdf_path)
print(is_digital)

pdf_path = "./digital_example.pdf"
print("\nChecking if " + str(pdf_path) + " is scanned ")
is_scanned = is_pdf_scanned(pdf_path)
print(is_scanned)
print("Checking if " + str(pdf_path) + " is digital ")
is_digital = is_pdf_digital(pdf_path)
print(is_digital)

print("\nEnd demo ")