Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if PDF is scanned image or contains text

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.

Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files?

environment: PYTHON 3.6

like image 687
Jinu Joseph Avatar asked Apr 16 '19 08:04

Jinu Joseph


People also ask

How can you tell if a PDF has text?

Start the Adobe® Acrobat® application and using "File > Open..." menu open a scanned PDF document. Select the "Tools" from the main toolbar. Double click on the "Enhance Scans" tool. Expand the "Recognize Text" pull down menu.

How do you know if a PDF contains only images?

Use "dtsearch" to create an index for all the pdf files... then "view the log file" of the indexing process to check the list of pdf files that were not indexed.

How do you know whether a document is scanned or not?

You can generally visually determine if a document is a scanned document by enlarging the picture on your screen and looking closely at the text. A scanned image will appear to have much poorer resolution, when looked at closely, than electronically created PDF document.


3 Answers

The below code will work, to extract data text data from both searchable and non-searchable PDF's.

import fitz

text = ""
path = "Your_scanned_or_partial_scanned.pdf"

doc = fitz.open(path)
for page in doc:
    text += page.getText()

If you don't have fitz module you need to do this:

pip install --upgrade pymupdf

like image 83
Rahul Agarwal Avatar answered Oct 23 '22 09:10

Rahul Agarwal


Building on top of Rahul Agarwal's solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.

You need to install fitz and PyMuPDF modules. You can do it by means of pip.

The following code has been tested with Python 3.7.9 and PyMuPDF 1.16.14. Moreover, it is important to install fitz BEFORE PyMuPDF, otherwise it provides some weird error about a missing frontend module (no idea why). So here is how I install the modules:

pip3 install fitz
pip3 install PyMuPDF==1.16.14

And here is the Python 3 implementation:

import fitz


def get_text_percentage(file_name: str) -> float:
    """
    Calculate the percentage of document that is covered by (searchable) text.

    If the returned percentage of text is very low, the document is
    most likely a scanned PDF
    """
    total_page_area = 0.0
    total_text_area = 0.0

    doc = fitz.open(file_name)

    for page_num, page in enumerate(doc):
        total_page_area = total_page_area + abs(page.rect)
        text_area = 0.0
        for b in page.getTextBlocks():
            r = fitz.Rect(b[:4])  # rectangle where block text appears
            text_area = text_area + abs(r)
        total_text_area = total_text_area + text_area
    doc.close()
    return total_text_area / total_page_area


if __name__ == "__main__":
    text_perc = get_text_percentage("my.pdf")
    print(text_perc)
    if text_perc < 0.01:
        print("fully scanned PDF - no relevant text")
    else:
        print("not fully scanned PDF - text is present")

Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them (e.g. this is the case for scanned PDFs processed by OCR sofware - such as pdfsandwich or Adobe Acrobat - that adds "invisible" text blocks on top of the image, so that you can select the text).

like image 40
Vito Gentile Avatar answered Oct 23 '22 09:10

Vito Gentile


def get_pdf_searchable_pages(fname):
    # pip install pdfminer
    from pdfminer.pdfpage import PDFPage
    searchable_pages = []
    non_searchable_pages = []
    page_num = 0
    with open(fname, 'rb') as infile:

        for page in PDFPage.get_pages(infile):
            page_num += 1
            if 'Font' in page.resources.keys():
                searchable_pages.append(page_num)
            else:
                non_searchable_pages.append(page_num)
    if page_num > 0:
        if len(searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is non-searchable")
        elif len(non_searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is searchable")
        else:
            print(f"searchable_pages : {searchable_pages}")
            print(f"non_searchable_pages : {non_searchable_pages}")
    else:
        print(f"Not a valid document")


if __name__ == '__main__':
    get_pdf_searchable_pages("1.pdf")
    get_pdf_searchable_pages("1Scanned.pdf")

Output:

Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable
like image 44
Vikas Goel Avatar answered Oct 23 '22 10:10

Vikas Goel