I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.
Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another PDF.
import PyPDF2 PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object pg4 = pfr.getPage(126) #extract pg 127 writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object #add pages writer.addPage(pg4) NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be with open(NewPDFfilename, "wb") as outputStream: writer.write(outputStream) #write pages to new PDF
My goal is to extract the table from the whole PDF document.
Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.
Here's how you can extract tables from PDFs. read_pdf('foo. pdf') >>> tables <TableList n=1> >>> tables. export('foo. csv', f='csv', compress=True) # json, excel, html, markdown, sqlite >>> tables[0] <Table shape=(7, 7)> >>> tables[0].
This is my code for extracting pdf.
import pandas as pd import tabula file = "filename.pdf" path = 'enter your directory path here' + file df = tabula.read_pdf(path, pages = '1', multiple_tables = True) print(df)
Please refer to this repo of mine for more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With