Use pdfplumber to find text in PDF, return page number, then return table

Tags:

I downloaded 42 PDFs which are each formatted similarly. Each has various tables, one of which is labeled "Campus Reported Incidents." That particular table is on a different page in each PDF. I want to write a function that will search for the page that has "Campus Reported Incidents" and scrape that table so that I can put it into a dataframe.

I figured that I could use PDFPlumber to search for the string "Campus Reported Incidents" and return the page number. I would then write a function that uses the page number to scrape the table I want, and I would loop that function through every PDF. However, I keep on getting the error "argument is not iterable" or "type object is not subscriptable." I looked through the PDFPlumber documentation but it didn't help my problem.

Here is one example of code that I tried:

url = "pdfs/example.pdf"

import pdfplumber

pdf = pdfplumber.open(url)

for page in range[0:len(pdf.pages)]:
    if 'Total number of physical restraints' in pdf.pages[page]:
        print(pdf.page_number)

338

asked May 01 '19 17:05

user3710004

1 Answers

I see this post was from a while ago but maybe this response will still help you or someone else.

The error looks like it's coming from the way you are looping through the pages. The range object is not a list, which is why you're seeing the "type object is not subscriptable" error message. Instead, try to "Enumerate" through the pages. The "i" will give you access to the index (aka current count in the loop). The "pg", will give you access to the page object in the PDF pages. I didn't use the "pg" variable below, but you could use that instead of "pages[i]" if you want.

The code below should print the tables from each page, as well as give you access to the tables to manipulate them further.

import pdfplumber
pdf_file = "pdfs/example.pdf"
tables=[]
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for i,pg in enumerate(pages):
        tbl = pages[i].extract_tables()
        print(f'{i} --- {tbl}')

171

answered Nov 02 '22 22:11

TheAccountemist

Related questions
                            
                                Popping items from a list using a loop
                            
                                Can't create new event loop after calling loop.close asyncio.get_event_loop in Python3.6.1
                            
                                Changing Scrapy/Splash user agent
                            
                                (Python 3) Spider must return Request, BaseItem, dict or None, got 'generator'
                            
                                Getting a OSError when trying to LIST ftp directories in Python
                            
                                In python, when to use a square or round brackets? [duplicate]
                            
                                Python 3 .format and List
                            
                                Django __str__ returned non-string (type NoneType)
                            
                                Changing the colour of text automatically inserted into tkinter widget
                            
                                How to integrate multi-languages support by creating language files in Django 2.0?
                            
                                Concat column name with data of first row, Python 3.6 Dataframe
                            
                                404 status code while making HTTP request via Python's "requests" library. However page is loading fine in browser
                            
                                Measure the height of a string in Tkinter Python?
                            
                                Convert python dictionary to uppercase
                            
                                Test if exact string appears in a Pandas Series
                            
                                zip()-like built-in function filling unequal lengths from left with None value
                            
                                iter() returned non-iterator of type 'dict_items'
                            
                                Compare two adjacent elements in same list
                            
                                What's the difference between df.head() and df.head?
                            
                                Does {m,n}? regex actually minimize repetitions or does it minimize number of characters matched?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use pdfplumber to find text in PDF, return page number, then return table

Tags:

python-3.x

pdf

user3710004

People also ask

1 Answers

TheAccountemist

Recent Activity

Donate For Us