Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use pdfplumber to find text in PDF, return page number, then return table

Tags:

python-3.x

pdf

I downloaded 42 PDFs which are each formatted similarly. Each has various tables, one of which is labeled "Campus Reported Incidents." That particular table is on a different page in each PDF. I want to write a function that will search for the page that has "Campus Reported Incidents" and scrape that table so that I can put it into a dataframe.

I figured that I could use PDFPlumber to search for the string "Campus Reported Incidents" and return the page number. I would then write a function that uses the page number to scrape the table I want, and I would loop that function through every PDF. However, I keep on getting the error "argument is not iterable" or "type object is not subscriptable." I looked through the PDFPlumber documentation but it didn't help my problem.

Here is one example of code that I tried:

url = "pdfs/example.pdf"

import pdfplumber

pdf = pdfplumber.open(url)

for page in range[0:len(pdf.pages)]:
    if 'Total number of physical restraints' in pdf.pages[page]:
        print(pdf.page_number)
like image 338
user3710004 Avatar asked May 01 '19 17:05

user3710004


People also ask

What is Pdfplumber used for?

Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging. Works best on machine-generated, rather than scanned, PDFs.


1 Answers

I see this post was from a while ago but maybe this response will still help you or someone else.

The error looks like it's coming from the way you are looping through the pages. The range object is not a list, which is why you're seeing the "type object is not subscriptable" error message. Instead, try to "Enumerate" through the pages. The "i" will give you access to the index (aka current count in the loop). The "pg", will give you access to the page object in the PDF pages. I didn't use the "pg" variable below, but you could use that instead of "pages[i]" if you want.

The code below should print the tables from each page, as well as give you access to the tables to manipulate them further.

import pdfplumber
pdf_file = "pdfs/example.pdf"
tables=[]
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for i,pg in enumerate(pages):
        tbl = pages[i].extract_tables()
        print(f'{i} --- {tbl}')
like image 171
TheAccountemist Avatar answered Nov 02 '22 22:11

TheAccountemist