Converting a PDF File to Images for Later OCR for Later Text Extraction Using JavaScript

Summary: I used the “pdf-to-img” JavaScript library/package to convert a PDF file to an image, but the library has too many deprecated sub-libraries for pdf-to-img to be useful in anything other than a demo scenario. Not every exploration turns out to be useful, but all explorations add knowledge to a personal skillset.

I rarely post blog articles on my technical failures, but I think this failure is possibly useful information.

I recently devised a system to extract the text from a scanned PDF document, using Python. Note that a scanned PDF document typically comes from a fax and is nearly impossible to work with, unlike a digital PDF document from an application like Adobe or Word. My extract-text system uses the PyMuPDF library and the OpenAI API. The PyMuPDF library has built-in OCR (optical character recognition) that, behind the scenes, converts the source PDF to a set of .png images and then extracts the text (presumably using Tesseract).

Fine. But in a moment of insanity, I decided to refactor my extract-text-from-PDF system from Python to JavaScript (node.js version). How difficult could it be?

Ugh. Hours and hours later, I sort of had a working JavaScript version. Maybe. But not really.

My problems started when I discovered that, unlike the Python PyMuPDF library, the JavaScript MuPDF.js library does not have integrated OCR. The MuPDF.js documentation suggests converting the source PDF document to image(s) first, then use Tesseract.js OCR to extract the text.

In other words, the MuPDF.js library wasn’t going to help at all. OK, how to convert a PDF document to image(s)? I decided that I wanted to use only JavaScript, as a challenge and as a matter of principle, not as a matter of practicality.

After a lot of time searching the Internet, I stumbled upon the node.js “pdf-to-img” library, which looked promising. After a bit of futzing about, I had a demo up and running. I fed my demo a short scanned PDF document, and successfully converted it to a .png image.

However, when I installed the node.js pdf-to-img library, I saw a slew of NPM warning messages that many (six) of the 40 sub-libraries were deprecated and no longer supported. In other words, it’s likely that the pdf-to-img library will stop working at some point in the future. So, if I was working on production code, using the pdf-to-img library would be entirely too risky.

The bottom line is that to convert a PDF document to images for later OCR for later text extraction, it’s probably best to find a better-supported JavaScript library, or avoid JavaScript altogether and use a well-supported Python library of some sort (the pdf2image library seems popular but I’ve never used it and the PyMuPDF library definitely works well).

In retrospect, it should have been obvious that programmatically converting a PDF to an image using JavaScript would be much more difficult than using Python.

Most social science and psychology research results are pretty obvious. For example, the research paper “Determinants and Consequences of Female Attractiveness and Sexiness” by professor Michael Lynn from Cornell University concluded that waitresses with larger chest size received larger tips from men. Obviously. I speculate this is really just saying that men give attractive waitresses larger tips. Double obvious.

Also obvious was immediate kneejerk academic reactions to the research study, including one by Cornell professor S. Colb. She complained that the research results were discriminatory. Hmm. Professor Colb should not leave academia where people can be offended by facts. Research data indicates that she wouldn’t do well as a waitress.

Strangely, there is some research evidence (not conclusive) which suggests that women customers tip attractive waitresses less than unattractive waitresses. The hypothesis is that women just don’t like women who are more attractive than themselves.

From left to right: Higher-than-average tips waitress (AI), very high tips waitress (AI), professor Colb.

Demo program:

// convert_pdf_to_images.js

// npm install pdf-to-img

import { promises as fs } from "node:fs";
import { pdf } from "pdf-to-img";

// ----------------------------------------------------------

async function pdfToImages(pdfPath, outDir)
{
  try {
    const doc = await pdf(pdfPath, {scale:2}); // higher res
    let pageCtr = 1;

    for await (const buff of doc) {
      const outPath = `${outDir}/page${pageCtr}.png`;
      await fs.writeFile(outPath, buff);
      ++pageCtr;
    }
  } catch (error) {
    console.error("Error:", error);
  }
}

// ----------------------------------------------------------

function main()
{
  const pdfPath = "./PDFs/scanned_example.pdf";
  const outDir = "./Images";
  console.log("\nConverting " + pdfPath.toString() +
    " to images ");
  pdfToImages(pdfPath, outDir); 
  console.log("Done "); 
}

main();