Searching text in a PDF using Python?

Tags:

Problem
I'm trying to determine what type a document is (e.g. pleading, correspondence, subpoena, etc) by searching through its text, preferably using python. All PDFs are searchable, but I haven't found a solution to parsing it with python and applying a script to search it (short of converting it to a text file first, but that could be resource-intensive for n documents).

What I've done so far
I've looked into pypdf, pdfminer, adobe pdf documentation, and any questions here I could find (though none seemed to directly solve this issue). PDFminer seems to have the most potential, but after reading through the documentation I'm not even sure where to begin.

Is there a simple, effective method for reading PDF text, either by page, line, or the entire document? Or any other workarounds?

611

asked Jun 13 '13 23:06

Insarov

1 Answers

This is called PDF mining, and is very hard because:

PDF is a document format designed to be printed, not to be parsed. Inside a PDF document, text is in no particular order (unless order is important for printing), most of the time the original text structure is lost (letters may not be grouped as words and words may not be grouped in sentences, and the order they are placed in the paper is often random).
There are tons of software generating PDFs, many are defective.

Tools like PDFminer use heuristics to group letters and words again based on their position in the page. I agree, the interface is pretty low level, but it makes more sense when you know what problem they are trying to solve (in the end, what matters is choosing how close from the neighbors a letter/word/line has to be in order to be considered part of a paragraph).

An expensive alternative (in terms of time/computer power) is generating images for each page and feeding them to OCR, may be worth a try if you have a very good OCR.

So my answer is no, there is no such thing as a simple, effective method for extracting text from PDF files - if your documents have a known structure, you can fine-tune the rules and get good results, but it is always a gambling.

I would really like to be proven wrong.

[update]

The answer has not changed but recently I was involved with two projects: one of them is using computer vision in order to extract data from scanned hospital forms. The other extracts data from court records. What I learned is:

Computer vision is at reach of mere mortals in 2018. If you have a good sample of already classified documents you can use OpenCV or SciKit-Image in order to extract features and train a machine learning classifier to determine what type a document is.
If the PDF you are analyzing is "searchable", you can get very far extracting all the text using a software like pdftotext and a Bayesian filter (same kind of algorithm used to classify SPAM).

So there is no reliable and effective method for extracting text from PDF files but you may not need one in order to solve the problem at hand (document type classification).

189

answered Sep 17 '22 08:09

Paulo Scardine

Related questions
                            
                                How do I solve overfitting in random forest of Python sklearn?
                            
                                Error installing Pillow on ubuntu 14.04
                            
                                How to get mini-batches in pytorch in a clean and efficient way?
                            
                                How to encode UTF8 filename for HTTP headers? (Python, Django)
                            
                                How to make a sunburst plot in R or Python?
                            
                                In Python, how do I iterate over one iterator and then another?
                            
                                gensim word2vec: Find number of words in vocabulary
                            
                                "Could not interpret optimizer identifier" error in Keras
                            
                                ValueError: Shape of passed values is (1, 6), indices imply (6, 6)
                            
                                numpy is already installed with Anaconda but I get an ImportError (DLL load failed: The specified module could not be found)
                            
                                How to install python developer package?
                            
                                Match text between two strings with regular expression
                            
                                Python: deleting a class attribute in a subclass
                            
                                forward fill specific columns in pandas dataframe
                            
                                Django Unit Testing taking a very long time to create test database
                            
                                Iterating over key and value of defaultdict dictionaries
                            
                                Convert pyQt UI to python
                            
                                How can I see print() statements in behave (BDD)
                            
                                How to import requirements.txt from an existing project using Poetry
                            
                                How do I get the string with name of a class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Searching text in a PDF using Python?

Tags:

python

text

parsing

pdf

Insarov

People also ask

1 Answers

Paulo Scardine

Recent Activity

Donate For Us