Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract table as text from the PDF using Python?

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.

Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another PDF.

import PyPDF2  PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored  pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object  pg4 = pfr.getPage(126) #extract pg 127  writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object #add pages writer.addPage(pg4)  NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be with open(NewPDFfilename, "wb") as outputStream:     writer.write(outputStream) #write pages to new PDF 

My goal is to extract the table from the whole PDF document.

Please have a look at the sample image of a page in PDF

like image 703
venkat Avatar asked Nov 28 '17 14:11

venkat


People also ask

How do I extract specific text from a PDF in Python?

Step 1: Import all libraries. Step 2: Convert PDF file to txt format and read data. Step 3: Use “. findall()” function of regular expressions to extract keywords.

How do you extract tabular data from PDF using Camelot in Python?

Here's how you can extract tables from PDFs. read_pdf('foo. pdf') >>> tables <TableList n=1> >>> tables. export('foo. csv', f='csv', compress=True) # json, excel, html, markdown, sqlite >>> tables[0] <Table shape=(7, 7)> >>> tables[0].


1 Answers

  • I would suggest you to extract the table using tabula.
  • Pass your pdf as an argument to the tabula api and it will return you the table in the form of dataframe.
  • Each table in your pdf is returned as one dataframe.
  • The table will be returned in a list of dataframea, for working with dataframe you need pandas.

This is my code for extracting pdf.

import pandas as pd import tabula file = "filename.pdf" path = 'enter your directory path here'  + file df = tabula.read_pdf(path, pages = '1', multiple_tables = True) print(df) 

Please refer to this repo of mine for more details.

like image 165
Himanshu Poddar Avatar answered Oct 01 '22 09:10

Himanshu Poddar