I downloaded 42 PDFs which are each formatted similarly. Each has various tables, one of which is labeled "Campus Reported Incidents." That particular table is on a different page in each PDF. I want to write a function that will search for the page that has "Campus Reported Incidents" and scrape that table so that I can put it into a dataframe.
I figured that I could use PDFPlumber to search for the string "Campus Reported Incidents" and return the page number. I would then write a function that uses the page number to scrape the table I want, and I would loop that function through every PDF. However, I keep on getting the error "argument is not iterable" or "type object is not subscriptable." I looked through the PDFPlumber documentation but it didn't help my problem.
Here is one example of code that I tried:
url = "pdfs/example.pdf"
import pdfplumber
pdf = pdfplumber.open(url)
for page in range[0:len(pdf.pages)]:
if 'Total number of physical restraints' in pdf.pages[page]:
print(pdf.page_number)
Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging. Works best on machine-generated, rather than scanned, PDFs.
I see this post was from a while ago but maybe this response will still help you or someone else.
The error looks like it's coming from the way you are looping through the pages. The range
object is not a list, which is why you're seeing the "type object is not subscriptable"
error message. Instead, try to "Enumerate"
through the pages. The "i"
will give you access to the index (aka current count in the loop). The "pg"
, will give you access to the page object in the PDF pages. I didn't use the "pg"
variable below, but you could use that instead of "pages[i]"
if you want.
The code below should print the tables from each page, as well as give you access to the tables to manipulate them further.
import pdfplumber
pdf_file = "pdfs/example.pdf"
tables=[]
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for i,pg in enumerate(pages):
tbl = pages[i].extract_tables()
print(f'{i} --- {tbl}')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With