How to extract text from a Specific Area in a PDF using Python?

Tags:

I'm trying to extract Text from a PDF using Python, and I have successfully done so using PyPDF2 like this:

from PyPDF2 import PdfFileReader
reader = PdfFileReader('path.pdf')
page = reader.getPage(0)
page.extractText()

This extracts all the Text from the Page, but I want to extract the text only from a Rectangular region of 3'x4' at the top-left part of the page.

I Basically want to do something like :How-to extract text from a pdf doc within a specific rectangular region? but in Python

Can this be done by PyPDF2 or by any other Python Library?

770

asked Aug 21 '17 07:08

Devdatta Tengshe

1 Answers

This is a rather complex topic, but it is possible. First you need to get familiar with the pdf format descripton.

Start here for example.

You can identify the location and contents of the text boxes and extract the string data.

This topic holds examples for pyPdf, the previous version of PyPDF2, but syntax is similar. There are examples on how to iterate through the indirect objects.

A good place to start is also the source of the function pageObj.extractText() that you used.

If you are not restricted to Python: How to extract text from a PDF?

You can also use a tool like iText RUPS to inspect the pdf. It shows how the content is rendered and placed on the page:

enter image description here

Afterwards you should be able to identify and address the elements and extract their content.

111

answered Oct 14 '22 02:10

Joe

Related questions
                            
                                cross platform numpy.random.seed()
                            
                                Why does @abstractmethod need to be used in a class whose metaclass is derived from ABCMeta?
                            
                                Python: How to deep copy a list of dictionaries
                            
                                ODBC Driver 13 for SQL Server can't open lib on pyodbc while connecting on AWS E2 ubuntu instance
                            
                                How to use a tensorflow model extracted from a trained keras model
                            
                                Apache Spark sort partition by user ID and write each partition to CSV
                            
                                Why does .loc behave differently depending on whether values are printed or assigned?
                            
                                How to read two lines from a file and create dynamics keys in a for-loop, a follow-up
                            
                                Random number generator differs between tensorflow 1.0.1 and 0.12.1
                            
                                PyCharm PEP8 Code Style highlights not working
                            
                                frequency axis in continuous wavelet transform plot (scaleogram) in python
                            
                                Python multiprocessing queue get() timeout despite full queue
                            
                                python KDE get contours and paths into specific json format leaflet-friendly
                            
                                Boost python getter/setter with the same name
                            
                                Auto-sklearn installation error
                            
                                What is a faster way to get the location of unique rows in numpy
                            
                                Python selenium send_keys emoji support
                            
                                Bokeh Interactive legend hide multiple glyphs
                            
                                How do I achieve sprintf-style formatting for bytes objects in python 3?
                            
                                Compact but pretty JSON output in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract text from a Specific Area in a PDF using Python?

Tags:

python

pdf

python-2.7

pypdf2

Devdatta Tengshe

People also ask

1 Answers

Joe

Recent Activity

Donate For Us