I am currently analyzing a set of PDF files. I want to know how many of the PDF files fall in those 3 categories:
It is easy to identify Image-only PDFs in my domain as every PDF contains text. If I cannot extract any text, it is image only. But how do I know if it is "just" a searchable PDF or if it is a digially created PDF?
By the way, it is not as simple as just looking at the producer as I have seen scanned documents where the Producer field said "Microsoft Word".
Note: As a human, it is easy. I just zoom in on the text. If I see pixels, it's "just" searchable.
Here are 3 example PDF files to test solutions:
Text-Based PDF Normally, you create the file in your software and then "print" it to a PDF printer. This converts the file to PDF format. These PDF files are text-based PDF, meaning that they retain the text and formatting of the original. Text-based PDF files are searchable because they contain real text.
After opening the PDF, try searching for a word known to be in the document (preferably a word that appears on several different pages) by clicking CTRL-F and entering the word in the Find box. If the message below appears, the document is not text-searchable.
With your original PDF and the one you want to check for changes now appearing in their appropriate document boxes, click on the blue COMPARE button below. Acrobat creates report, indicating number of changes made. Scroll down to view both PDFs side-by-side. Hoover over highlighted text to see what has been changed.
Searchable PDFs usually result through the application of OCR (Optical Character Recognition) to scanned PDFs or other image-based documents. During the text recognition process, characters and the document structure are analyzed and “read”. A text layer is added to the image layer, usually placed underneath.
Not all PDF files are searchable. If the PDF file is created with a PDF editor software, it contains text elements in page content streams of the PDF. But if a PDF is created by scanning a text document, there is no character/text information available.
Navigate to the PDF to Word converteron your web browser. Upload your file to the converter. Download your converted file. The converter will recognize all text on your PDF file. You can then simply convert the Word file back to a PDF document, and the text will remain searchable. Work on your searchable PDF files.
Click the "Start OCR" button to start OCR process. Wait until the recognition result displayed. Save OCR results as Searchable PDF or view them in browser. How to create Searchable PDF and extract text from scan PDF? Convert PDF to Searchable PDF with Aspose OCR software: Click inside the file drop area to upload PDF file or drag & drop PDF file.
Easily search through long PDF files with a lot of text. Learn how to make scanned PDF documents searchable. Once you scan a paper document into a PDF file, you may notice that you can’t search the text. Your scanner captures the pages as a flat image, which means there’s no text your PDF viewer can recognize.
With PyMuPDF you can easily remove all text as is required for @ypnos' suggestion.
As an alternative, with PyMuPDF you can also check whether text is hidden in a PDF. In PDF's relevant "mini-language" this is triggered by the command 3 Tr
("text render mode", e.g. see page 402 of https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf).
So if all text is under the influence of this command, then none of it will be rendered - allowing the conclusion "this is an OCR'ed page".
Modified this answer from How to check if PDF is scanned image or contains text
In this solution you don't have to render the pdf so I would guess it is faster. Basically the answer I modified used the percentage of the pdf area covered by text to determine if it is a text document or a scanned document (image).
I added a similar reasoning, calculating total area covered by images to calculate the percentage covered by images. If it is mostly covered by images you can assume it is scanned document. You can move the threshold around to fit your document collection.
I also added logic to check page by page. This is because at least in the document collection I have, some documents might have a digitally created first page and then the rest is scanned.
Modified code:
import fitz # pip install PyMuPDF
def page_type(page):
page_area = abs(page.rect) # Total page area
img_area = 0.0
for block in page.getText("RAWDICT")["blocks"]:
if block["type"] == 1: # Type=1 are images
bbox=block["bbox"]
img_area += (bbox[2]-bbox[0])*(bbox[3]-bbox[1]) # width*height
img_perc = img_area / page_area
print("Image area proportion: " + str(img_perc))
text_area = 0.0
for b in page.getTextBlocks():
r = fitz.Rect(b[:4]) # Rectangle where block text appears
text_area = text_area + abs(r)
text_perc = text_area / page_area
print("Text area proportion: " + str(text_perc))
if text_perc < 0.01: #No text = Scanned
page_type = "Scanned"
elif img_perc > .8: #Has text but very large images = Searchable
page_type = "Searchable text"
else:
page_type = "Digitally created"
return page_type
doc = fitz.open(pdffilepath)
for page in doc: #Iterate through pages to find different types
print(page_type(page))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With