Programmatically Analyzing a PDF File Using the OpenAI API with Python

Unless you’ve been living under a large rock for the past couple of years, you know that the ChatGPT application, based on a GPT-x large language model, is dramatically changing the way million of people do things. One morning before work, just for hoots, I figured I’d write a program that can programmatically analyze a PDF document. How hard could it be?

Well, many hours later, I finally had a short demo up and running. But what was really interesting, and what took me so long, was the massive amounts of misleading and out-of-date information I found on the Internet. ChatGPT, and applications based on Gemini (Google), Llama (Meta), Claude (Anthropic) and other LLMs, are changing with astonishing speed, which leads to the vast majority of information on the Internet being completely out of date.

For example, not so long ago, if you wanted to construct an application that has a LLM conversation, you’d probably use the open source langchain library. But now, if you’re using ChatGPT, it’s dramatically simpler to use the Assistant API. And a few weeks ago (as I write this post), the OpenAI.responses.create() method replaced the OpenAI.chat.completions.create(). And the “developer” role replaced the “system” role. And on, and on, and on.

Anyway, I have an existing OpenAI account and a couple of existing keys for use in programs that call the API. I created a PDF document using the first couple of pages of the Wikipedia entry on the planet Venus. (I copy-pasted the text into Microsoft Word and then saved as a digital PDF file).

I wrote my demo using the Python language version of the OpenAI API. I slightly prefer working with Python rather than the JavaScript API when I’m experimenting. I prepped the program by issuing a “pip install openai” command.

Here’s an output of one run of my demo:

C:\Python\ChatGPT\QueryPDF: python pdf_files_chatgpt.py

Begin ChatGPT PDF file demo

Reading file ./venus.pdf

The question is:
What is an interesting fact about Venus?

The answer is:
An interesting fact about Venus is that **a day on Venus
is longer than a year on Venus**!

- **One Venusian day** (the time it takes for Venus to
   rotate once on its axis) is about **243 Earth days**.
- **One Venusian year** (the time it takes for Venus to
   orbit the Sun) is about **224.7 Earth days**.

This means that Venus rotates so slowly that its day is
actually longer than its year! Additionally, Venus

Done

C:\Python\ChatGPT\QueryPDF:

Quite impressive. Notice that the output is truncated because I set max_output_tokens = 100.

I’m not 100% sure I know what’s going on, but my assumption is that behind the scenes, the OpenAI API uses optical character recognition (most likely the Tesseract library) to extract the text from the source PDF file, and then uses the extracted text as the context for the supplied query.

I investigated and noticed that the demo worked with all three types of PDF files: scanned PDF(essentially an image), scanned-OCR PDF (an unusual format), and digital PDF. Behind the scenes, the API, if necessary, does all the OCR and preparation — a lot of work that had to be done manually as recently as just a few weeks ago.

I wanted to make sure that the query-PDF-file program was correctly pulling information from the PDF file, rather than using the GPT built-in knowledge of Venus that it learned during training (GPT uses the text of Wikipedia for training). I modified the source PDF file to include the fake sentence, “A little-known fact is that the surface of Venus is pink, not green.” And then a paragraph later I added a fake sentence, “Because Venus soil is pink, Venus is sometimes called the cotton candy planet.” When I reran the query program, the response was:

The answer is:
An interesting fact about Venus is that its surface is
pink, not green. This is a little-known detail mentioned
in your document. Because of its pink soil, Venus is
sometimes called the "cotton candy planet."

So, indeed, the program was pulling from the source PDF document. Very nice.

Weirdly, I assumed the demo code would work with ordinary .txt files, but nope, so that will be the topic of a future investigation.

I reiterate that the main takeaway is that AI via ChatGPT, Claude, and the others, is changing with blistering speed, which means that the Internet is littered with out-of-date and irrelevant examples (maybe even this one by the time you’re reading it.)

Both AI image generation and AI natural language processing are getting better and better. But there are still glitches.

Left: Image generated by “Give me salmon in a river”.

Center: Image generated by “Give me a wiener dog race”.

Right: Image generated by “Give me a fisher man”.

Demo program:

# pdf_files_chatgpt.py
# query a PDF file

from openai import OpenAI

print("\nBegin ChatGPT PDF file demo ")

key = "sk-proj-_AX7bGTXUwg-qojh2T5Z2CVXrox" + \
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" + \
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" + \
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

fp = "./venus.pdf"  # digital
print("\nReading file " + fp)

client = OpenAI(api_key=key)

f = client.files.create(
  file=open(fp, "rb"),
  purpose="user_data",
)

# question = input("Ask me a question about Venus: \n")
question = "What is an interesting fact about Venus?"
print("\nThe question is: ")
print(question)

response = client.responses.create(
  model = "gpt-4.1",
  input = [
    {
      "role": "system",
      "content": "You analyze PDF files",
    },
    {
      "role": "user",
      "content": [
        { "type": "input_file", "file_id": f.id, },
        { "type": "input_text", "text": question, },
      ]
    },
  ],
  temperature = 0.2,  # not creative
  top_p = 1.0,  # default
  max_output_tokens = 100,
)

print("\nThe answer is: ")
print(response.output_text)

print("\nDone ")