How can I distinguish a digitally-created PDF from a searchable PDF?

Tags:

pdf

I am currently analyzing a set of PDF files. I want to know how many of the PDF files fall in those 3 categories:

Digitally Created PDF: The text is there (copyable) and it is guaranteed to be correct as it was created directly e.g. from Word
Image-only PDF: A scanned document
Searchable PDF: A scanned document, but an OCR engine was used. The OCR engine put text "below" the image so that you can search / copy the content. As OCR is pretty good, this is correct most of the time. But it is not guaranteed to be correct.

It is easy to identify Image-only PDFs in my domain as every PDF contains text. If I cannot extract any text, it is image only. But how do I know if it is "just" a searchable PDF or if it is a digially created PDF?

By the way, it is not as simple as just looking at the producer as I have seen scanned documents where the Producer field said "Microsoft Word".

Note: As a human, it is easy. I just zoom in on the text. If I see pixels, it's "just" searchable.

Here are 3 example PDF files to test solutions:

Digitally Created PDF
Scanned PDF: Well.. not really; I used a script to create images and then put them together as a PDF. But that only means that the quality is very good. It should be very similar to a scan.
Searchable PDF

What I tried/thought about

Using the creator/producer: I see "Microsoft Word" in scanned documents. Also this would be tedious.
Embedded fonts: You can extract embedded fonts. The idea was that a scanned document would not have embedded fonts but just use the default. The idea was wrong, as one can see with the example.

301

asked Aug 19 '20 20:08

Martin Thoma

2 Answers

With PyMuPDF you can easily remove all text as is required for @ypnos' suggestion.

As an alternative, with PyMuPDF you can also check whether text is hidden in a PDF. In PDF's relevant "mini-language" this is triggered by the command 3 Tr ("text render mode", e.g. see page 402 of https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf). So if all text is under the influence of this command, then none of it will be rendered - allowing the conclusion "this is an OCR'ed page".

175

answered Nov 09 '22 02:11

Jorj McKie

Modified this answer from How to check if PDF is scanned image or contains text

In this solution you don't have to render the pdf so I would guess it is faster. Basically the answer I modified used the percentage of the pdf area covered by text to determine if it is a text document or a scanned document (image).

I added a similar reasoning, calculating total area covered by images to calculate the percentage covered by images. If it is mostly covered by images you can assume it is scanned document. You can move the threshold around to fit your document collection.

I also added logic to check page by page. This is because at least in the document collection I have, some documents might have a digitally created first page and then the rest is scanned.

Modified code:

import fitz # pip install PyMuPDF

def page_type(page):

    page_area = abs(page.rect) # Total page area

    img_area = 0.0
    for block in page.getText("RAWDICT")["blocks"]:
        if block["type"] == 1: # Type=1 are images
            bbox=block["bbox"]
            img_area += (bbox[2]-bbox[0])*(bbox[3]-bbox[1]) # width*height
    img_perc = img_area / page_area
    print("Image area proportion: " + str(img_perc))

    text_area = 0.0
    for b in page.getTextBlocks():
        r = fitz.Rect(b[:4])  # Rectangle where block text appears
        text_area = text_area + abs(r)
    text_perc = text_area / page_area
    print("Text area proportion: " + str(text_perc))

    if text_perc < 0.01: #No text = Scanned
        page_type = "Scanned"
    elif img_perc > .8:  #Has text but very large images = Searchable
        page_type = "Searchable text" 
    else:
        page_type = "Digitally created"
    return page_type


doc = fitz.open(pdffilepath)

for page in doc: #Iterate through pages to find different types
    print(page_type(page))

answered Nov 09 '22 02:11

Manuel Ruiz

Related questions
                            
                                Why doesn't this higher-order function pass static type checking in mypy?
                            
                                Unable to install ansible due to python dependency on Ubuntu 18.04
                            
                                Avoiding module namespace pollution in Python
                            
                                Using psutil.Process.memory_info memory usage differs from Pandas.memory_usage
                            
                                Getting "bad escape" when using nltk in py3
                            
                                How to implement parallel, delayed in such a way that the parallelized for loop stops when output goes below a threshold?
                            
                                Inference with TensorRT .engine file on python
                            
                                Difference on context manager with and without "as" clause
                            
                                How can I upload a PIL Image object to a Discord chat without saving the image?
                            
                                Jupyter Notebook Memory Management
                            
                                Py3: Can't open file /snapshot/serverless/lib/plugins/aws/invokeLocal/invoke.py : No such file or directory
                            
                                atom fail to start a terminal due to nuclide
                            
                                Python - psycopg2 giving error after execution
                            
                                Best way to handle path with pandas
                            
                                Pre-populate current value of WTForms field in order to edit it
                            
                                Bug in Numpy ndarray min/max method
                            
                                Accuracy with TF-IDF and non-TF-IDF features
                            
                                Input 0 of layer lstm_5 is incompatible with the layer: expected ndim=3, found ndim=2
                            
                                How to create a nested dictionary from existing dictionary with set and list of tuples

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With