Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I distinguish a digitally-created PDF from a searchable PDF?

Tags:

python

pdf

I am currently analyzing a set of PDF files. I want to know how many of the PDF files fall in those 3 categories:

  • Digitally Created PDF: The text is there (copyable) and it is guaranteed to be correct as it was created directly e.g. from Word
  • Image-only PDF: A scanned document
  • Searchable PDF: A scanned document, but an OCR engine was used. The OCR engine put text "below" the image so that you can search / copy the content. As OCR is pretty good, this is correct most of the time. But it is not guaranteed to be correct.

It is easy to identify Image-only PDFs in my domain as every PDF contains text. If I cannot extract any text, it is image only. But how do I know if it is "just" a searchable PDF or if it is a digially created PDF?

By the way, it is not as simple as just looking at the producer as I have seen scanned documents where the Producer field said "Microsoft Word".

Note: As a human, it is easy. I just zoom in on the text. If I see pixels, it's "just" searchable.

Here are 3 example PDF files to test solutions:

  • Digitally Created PDF
  • Scanned PDF: Well.. not really; I used a script to create images and then put them together as a PDF. But that only means that the quality is very good. It should be very similar to a scan.
  • Searchable PDF

What I tried/thought about

  • Using the creator/producer: I see "Microsoft Word" in scanned documents. Also this would be tedious.
  • Embedded fonts: You can extract embedded fonts. The idea was that a scanned document would not have embedded fonts but just use the default. The idea was wrong, as one can see with the example.
like image 301
Martin Thoma Avatar asked Aug 19 '20 20:08

Martin Thoma


People also ask

What is the difference between a PDF and a searchable PDF?

Text-Based PDF Normally, you create the file in your software and then "print" it to a PDF printer. This converts the file to PDF format. These PDF files are text-based PDF, meaning that they retain the text and formatting of the original. Text-based PDF files are searchable because they contain real text.

How do you know if a PDF is searchable?

After opening the PDF, try searching for a word known to be in the document (preferably a word that appears on several different pages) by clicking CTRL-F and entering the word in the Find box. If the message below appears, the document is not text-searchable.

Can you identify an edited PDF?

With your original PDF and the one you want to check for changes now appearing in their appropriate document boxes, click on the blue COMPARE button below. Acrobat creates report, indicating number of changes made. Scroll down to view both PDFs side-by-side. Hoover over highlighted text to see what has been changed.

What makes a PDF document searchable?

Searchable PDFs usually result through the application of OCR (Optical Character Recognition) to scanned PDFs or other image-based documents. During the text recognition process, characters and the document structure are analyzed and “read”. A text layer is added to the image layer, usually placed underneath.

Are all PDF files searchable?

Not all PDF files are searchable. If the PDF file is created with a PDF editor software, it contains text elements in page content streams of the PDF. But if a PDF is created by scanning a text document, there is no character/text information available.

How do I make a PDF file searchable on Windows 10?

Navigate to the PDF to Word converteron your web browser. Upload your file to the converter. Download your converted file. The converter will recognize all text on your PDF file. You can then simply convert the Word file back to a PDF document, and the text will remain searchable. Work on your searchable PDF files.

How to create searchable PDF and extract text from scan?

Click the "Start OCR" button to start OCR process. Wait until the recognition result displayed. Save OCR results as Searchable PDF or view them in browser. How to create Searchable PDF and extract text from scan PDF? Convert PDF to Searchable PDF with Aspose OCR software: Click inside the file drop area to upload PDF file or drag & drop PDF file.

Can you search through long PDF files with a lot of text?

Easily search through long PDF files with a lot of text. Learn how to make scanned PDF documents searchable. Once you scan a paper document into a PDF file, you may notice that you can’t search the text. Your scanner captures the pages as a flat image, which means there’s no text your PDF viewer can recognize.


2 Answers

With PyMuPDF you can easily remove all text as is required for @ypnos' suggestion.

As an alternative, with PyMuPDF you can also check whether text is hidden in a PDF. In PDF's relevant "mini-language" this is triggered by the command 3 Tr ("text render mode", e.g. see page 402 of https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf). So if all text is under the influence of this command, then none of it will be rendered - allowing the conclusion "this is an OCR'ed page".

like image 175
Jorj McKie Avatar answered Nov 09 '22 02:11

Jorj McKie


Modified this answer from How to check if PDF is scanned image or contains text

In this solution you don't have to render the pdf so I would guess it is faster. Basically the answer I modified used the percentage of the pdf area covered by text to determine if it is a text document or a scanned document (image).

I added a similar reasoning, calculating total area covered by images to calculate the percentage covered by images. If it is mostly covered by images you can assume it is scanned document. You can move the threshold around to fit your document collection.

I also added logic to check page by page. This is because at least in the document collection I have, some documents might have a digitally created first page and then the rest is scanned.

Modified code:

import fitz # pip install PyMuPDF

def page_type(page):

    page_area = abs(page.rect) # Total page area

    img_area = 0.0
    for block in page.getText("RAWDICT")["blocks"]:
        if block["type"] == 1: # Type=1 are images
            bbox=block["bbox"]
            img_area += (bbox[2]-bbox[0])*(bbox[3]-bbox[1]) # width*height
    img_perc = img_area / page_area
    print("Image area proportion: " + str(img_perc))

    text_area = 0.0
    for b in page.getTextBlocks():
        r = fitz.Rect(b[:4])  # Rectangle where block text appears
        text_area = text_area + abs(r)
    text_perc = text_area / page_area
    print("Text area proportion: " + str(text_perc))

    if text_perc < 0.01: #No text = Scanned
        page_type = "Scanned"
    elif img_perc > .8:  #Has text but very large images = Searchable
        page_type = "Searchable text" 
    else:
        page_type = "Digitally created"
    return page_type


doc = fitz.open(pdffilepath)

for page in doc: #Iterate through pages to find different types
    print(page_type(page))
like image 45
Manuel Ruiz Avatar answered Nov 09 '22 02:11

Manuel Ruiz