Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I process a pdf using OpenAI's APIs (GPTs)?

The web interface for ChatGPT has an easy pdf upload. Is there an API from openAI that can receive pdfs?

I know there are 3rd party libraries that can read pdf but given there are images and other important information in a pdf, it might be better if a model like GPT 4 Turbo was fed the actual pdf directly.

I'll state my use case to add more context. I intent to do RAG. In the code below I handle the PDF and a prompt. Normally I'd append the text at the end of the prompt. I could still do that with a pdf if I extract its contents manually.

The following code is taken from here https://platform.openai.com/docs/assistants/tools/code-interpreter. Is this how I'm supposed to do it?

# Upload a file with an "assistants" purpose
file = client.files.create(
  file=open("example.pdf", "rb"),
  purpose='assistants'
)

# Create an assistant using the file ID
assistant = client.beta.assistants.create(
  instructions="You are a personal math tutor. When asked a math question, write and run code to answer the question.",
  model="gpt-4-1106-preview",
  tools=[{"type": "code_interpreter"}],
  file_ids=[file.id]
)

There is an upload endpoint as well, but it seems the intent of those endpoints are for fine-tuning and assistants. I think the RAG use case is a normal one and not necessarily related to assistants.

like image 692
Muhammad Mubashirullah Durrani Avatar asked Sep 01 '25 21:09

Muhammad Mubashirullah Durrani


2 Answers

May 2025 edit: according to the official guide, using OpenAI GPT-4.1 allows to extract content of (or answer questions on) an input pdf file foobar.pdf stored locally, with a solution along the lines of

from openai import OpenAI
import os

filename = "foobar.pdf"
prompt = """Extract the content from the file provided without altering it.
            Just output its exact content and nothing else."""

client = OpenAI(api_key=os.environ.get("MY_OPENAI_KEY"))

file = client.files.create(
    file=open(filename, "rb"),
    purpose="user_data"
)

response = client.responses.create(
    model="gpt-4.1",
    input=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_file",
                    "file_id": file.id,
                },
                {
                    "type": "input_text",
                    "text": prompt,
                },
            ]
        }
    ]
)

The prompt can of course be replaced with the desired user request and I assume that the openai key is stored in a env var named MY_OPENAI_KEY.

P.S. I have edited the answer as this approach is much more streamlined w.r.t to the assistants-based 2024 solution that you can see in the edit history, heavily inspired by https://medium.com/@erik-kokalj/effectively-analyze-pdfs-with-gpt-4o-api-378bd0f6be03

like image 152
Davide Fiocco Avatar answered Sep 03 '25 11:09

Davide Fiocco


One solution: Convert the pdf to images and feed it to the vision model as multi image inputs https://platform.openai.com/docs/guides/vision.

GPT-4 with vision is not a different model that does worse at text tasks because it has vision, it is simply GPT-4 with vision added

Since its the same model with vision capabilities, this should be sufficient to do both text and image analysis.

You could also choose to extract images from pdf and feed those separately making a multi-model architecture. I have a preference for the first. Ideally experiments should be run to see what produces better results.

Text only + images only VS Images (containing both)

Pdf to image can be done in python locally as can separating img from pdf. It isn't a difficult task requiring support from someone like openAI.

like image 22
Muhammad Mubashirullah Durrani Avatar answered Sep 03 '25 10:09

Muhammad Mubashirullah Durrani