Programmatically Extracting Text From a PDF Document Using C#: I Like the PdfPig Library the Best

One of my work projects involves programmatically extracting the text from a PDF document. This is a surprisingly difficult task because PDF was designed for visual purposes — the actual text is not stored in a PDF document.

There are two main kinds of PDF files: digital (high fidelity, produced by an application like Adobe or Word) and scanned (crude, typically from a fax or camera photo).

I did an online search and the two leading contender code libraries for C# appeared to be the iText (formerly called iText7 and iTextSharp) and PdfPig. I experimented with both libraries, and overall, for text extraction, I preferred PdfPig. For digital PDFs, PdfPig is simpler than iText, but for scanned image-based PDFs, iText is better.

Note: For extracting text from PDF files, the Python language PyMuPDF library package is better than either C# code library.

I fired up Visual Studio 2022 (free Community Edition) and created a new Console application. After the template loaded, I did Tools | NuGet Package Manager | Manage NuGet Packages for Solution. I browsed to find PdfPig, found it, and added it to the Solution.

I used Microsoft Word to create a short dummy PDF document, and saved it in the project root directory. The demo program is:

using System;
using System.IO;
using UglyToad.PdfPig;  // PdfDocument
using UglyToad.PdfPig.Content;  // Page

namespace PdfPigDemo
{
  internal class Program
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin PDF-to-text demo\n");

      string sourcePDF = @"C:\CSharp\PdfPigDemo\dummy.pdf";
      using (PdfDocument doc = PdfDocument.Open(sourcePDF)) 
      {
        foreach (Page page in doc.GetPages())
        {
          string pageText = page.Text;
          Console.WriteLine(pageText);
        }
      }

      Console.WriteLine("\nDone ");
      Console.ReadLine();
    } // Main
  } // class Program
} // ns

Nice and simple.

Alas, for natural language processing, the C# language just doesn’t have anywhere near as many resources as the Python language. So, for my work NLP projects, this investigation convinced me that I need to use Python instead of C#.

In some sense, PDF is ordinary text wearing a disguise.

Left: One of the earliest TV series was “The Lone Ranger” (1949-1957). I’ve seen a few episodes and the series holds up surprisingly well in terms of entertainment value. The Ranger was Texas ranger John Reid in disguise. Tonto the Indian sidekick was brave and intelligent.

Right: “Zorro” (1957-1959) was hugely popular. It too holds up very well even by today’s standards. Zorro was rich Don Diego de la Vega in disguise. Faithful servant Bernardo was intelligent and brave when the chips were down.