I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.
Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files?
environment: PYTHON 3.6
Start the Adobe® Acrobat® application and using "File > Open..." menu open a scanned PDF document. Select the "Tools" from the main toolbar. Double click on the "Enhance Scans" tool. Expand the "Recognize Text" pull down menu.
Use "dtsearch" to create an index for all the pdf files... then "view the log file" of the indexing process to check the list of pdf files that were not indexed.
You can generally visually determine if a document is a scanned document by enlarging the picture on your screen and looking closely at the text. A scanned image will appear to have much poorer resolution, when looked at closely, than electronically created PDF document.
The below code will work, to extract data text data from both searchable and non-searchable PDF's.
import fitz
text = ""
path = "Your_scanned_or_partial_scanned.pdf"
doc = fitz.open(path)
for page in doc:
text += page.getText()
If you don't have fitz
module you need to do this:
pip install --upgrade pymupdf
Building on top of Rahul Agarwal's solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.
You need to install fitz
and PyMuPDF
modules. You can do it by means of pip
.
The following code has been tested with Python 3.7.9 and PyMuPDF
1.16.14. Moreover, it is important to install fitz
BEFORE PyMuPDF
, otherwise it provides some weird error about a missing frontend module (no idea why). So here is how I install the modules:
pip3 install fitz
pip3 install PyMuPDF==1.16.14
And here is the Python 3 implementation:
import fitz
def get_text_percentage(file_name: str) -> float:
"""
Calculate the percentage of document that is covered by (searchable) text.
If the returned percentage of text is very low, the document is
most likely a scanned PDF
"""
total_page_area = 0.0
total_text_area = 0.0
doc = fitz.open(file_name)
for page_num, page in enumerate(doc):
total_page_area = total_page_area + abs(page.rect)
text_area = 0.0
for b in page.getTextBlocks():
r = fitz.Rect(b[:4]) # rectangle where block text appears
text_area = text_area + abs(r)
total_text_area = total_text_area + text_area
doc.close()
return total_text_area / total_page_area
if __name__ == "__main__":
text_perc = get_text_percentage("my.pdf")
print(text_perc)
if text_perc < 0.01:
print("fully scanned PDF - no relevant text")
else:
print("not fully scanned PDF - text is present")
Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them (e.g. this is the case for scanned PDFs processed by OCR sofware - such as pdfsandwich or Adobe Acrobat - that adds "invisible" text blocks on top of the image, so that you can select the text).
def get_pdf_searchable_pages(fname):
# pip install pdfminer
from pdfminer.pdfpage import PDFPage
searchable_pages = []
non_searchable_pages = []
page_num = 0
with open(fname, 'rb') as infile:
for page in PDFPage.get_pages(infile):
page_num += 1
if 'Font' in page.resources.keys():
searchable_pages.append(page_num)
else:
non_searchable_pages.append(page_num)
if page_num > 0:
if len(searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is non-searchable")
elif len(non_searchable_pages) == 0:
print(f"Document '{fname}' has {page_num} page(s). "
f"Complete document is searchable")
else:
print(f"searchable_pages : {searchable_pages}")
print(f"non_searchable_pages : {non_searchable_pages}")
else:
print(f"Not a valid document")
if __name__ == '__main__':
get_pdf_searchable_pages("1.pdf")
get_pdf_searchable_pages("1Scanned.pdf")
Output:
Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With