How to check if PDF is scanned image or contains text

3 Answers

The below code will work, to extract data text data from both searchable and non-searchable PDF's.

import fitz

text = ""
path = "Your_scanned_or_partial_scanned.pdf"

doc = fitz.open(path)
for page in doc:
    text += page.getText()

If you don't have fitz module you need to do this:

pip install --upgrade pymupdf

answered Oct 23 '22 09:10

Rahul Agarwal

Building on top of Rahul Agarwal's solution, along with some snippets I found at this link, here is a possible algorithm that should solve your problem.

You need to install fitz and PyMuPDF modules. You can do it by means of pip.

The following code has been tested with Python 3.7.9 and PyMuPDF 1.16.14. Moreover, it is important to install fitz BEFORE PyMuPDF, otherwise it provides some weird error about a missing frontend module (no idea why). So here is how I install the modules:

pip3 install fitz
pip3 install PyMuPDF==1.16.14

And here is the Python 3 implementation:

import fitz


def get_text_percentage(file_name: str) -> float:
    """
    Calculate the percentage of document that is covered by (searchable) text.

    If the returned percentage of text is very low, the document is
    most likely a scanned PDF
    """
    total_page_area = 0.0
    total_text_area = 0.0

    doc = fitz.open(file_name)

    for page_num, page in enumerate(doc):
        total_page_area = total_page_area + abs(page.rect)
        text_area = 0.0
        for b in page.getTextBlocks():
            r = fitz.Rect(b[:4])  # rectangle where block text appears
            text_area = text_area + abs(r)
        total_text_area = total_text_area + text_area
    doc.close()
    return total_text_area / total_page_area


if __name__ == "__main__":
    text_perc = get_text_percentage("my.pdf")
    print(text_perc)
    if text_perc < 0.01:
        print("fully scanned PDF - no relevant text")
    else:
        print("not fully scanned PDF - text is present")

Although this answers your question (i.e. distinguish between fully scanned and full/partial textual PDFs), this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them (e.g. this is the case for scanned PDFs processed by OCR sofware - such as pdfsandwich or Adobe Acrobat - that adds "invisible" text blocks on top of the image, so that you can select the text).

answered Oct 23 '22 09:10

Vito Gentile

def get_pdf_searchable_pages(fname):
    # pip install pdfminer
    from pdfminer.pdfpage import PDFPage
    searchable_pages = []
    non_searchable_pages = []
    page_num = 0
    with open(fname, 'rb') as infile:

        for page in PDFPage.get_pages(infile):
            page_num += 1
            if 'Font' in page.resources.keys():
                searchable_pages.append(page_num)
            else:
                non_searchable_pages.append(page_num)
    if page_num > 0:
        if len(searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is non-searchable")
        elif len(non_searchable_pages) == 0:
            print(f"Document '{fname}' has {page_num} page(s). "
                  f"Complete document is searchable")
        else:
            print(f"searchable_pages : {searchable_pages}")
            print(f"non_searchable_pages : {non_searchable_pages}")
    else:
        print(f"Not a valid document")


if __name__ == '__main__':
    get_pdf_searchable_pages("1.pdf")
    get_pdf_searchable_pages("1Scanned.pdf")

Output:

Document '1.pdf' has 1 page(s). Complete document is searchable
Document '1Scanned.pdf' has 1 page(s). Complete document is non-searchable

answered Oct 23 '22 10:10

Vikas Goel

Related questions
                            
                                sampling with weight using pyspark
                            
                                Python: Given a set of N elements, choose k at random, m times
                            
                                Python: encapsulation in frequently called function
                            
                                Dask Dataframe groupby has no len()
                            
                                Getting non null latest value in python pandas dataframe
                            
                                Accessing a class instance in a library from two separate scripts in a project
                            
                                python - logging in Multi Threads
                            
                                Use numpy.tensordot to replace a nested loop
                            
                                Math overflow error in scipy Anderson-Darling test for k-samples
                            
                                QuantileRegression ValueError: operands could not be broadcast together with shapes
                            
                                Remove lowest color from colorbar in Seaborn/Matplotlib
                            
                                Spark submit (2.3) on kubernetes cluster from Python
                            
                                Github API call for user accounts
                            
                                Understanding infinite loading when using Scrapy - what's wrong?
                            
                                Setting the index after merging with pandas?
                            
                                WTForms: How to select options in SelectMultipleField?
                            
                                staticmethod and recursion?
                            
                                Fast max-flow min-cut library for Python
                            
                                How to get the type of a Tensor?
                            
                                How to use Graphene GraphQL framework with Django REST Framework authentication

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to check if PDF is scanned image or contains text

Tags:

python

python-3.x

pypdf2

pdfminer

pdf-extraction

Jinu Joseph

People also ask

3 Answers

Rahul Agarwal

Vito Gentile

Vikas Goel

Recent Activity

Donate For Us