How to extract table as text from the PDF using Python?

Tags:

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.

Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another PDF.

import PyPDF2  PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored  pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object  pg4 = pfr.getPage(126) #extract pg 127  writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object #add pages writer.addPage(pg4)  NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be with open(NewPDFfilename, "wb") as outputStream:     writer.write(outputStream) #write pages to new PDF

My goal is to extract the table from the whole PDF document.

703

asked Nov 28 '17 14:11

venkat

1 Answers

I would suggest you to extract the table using tabula.
Pass your pdf as an argument to the tabula api and it will return you the table in the form of dataframe.
Each table in your pdf is returned as one dataframe.
The table will be returned in a list of dataframea, for working with dataframe you need pandas.

This is my code for extracting pdf.

import pandas as pd import tabula file = "filename.pdf" path = 'enter your directory path here'  + file df = tabula.read_pdf(path, pages = '1', multiple_tables = True) print(df)

Please refer to this repo of mine for more details.

165

answered Oct 01 '22 09:10

Himanshu Poddar

Related questions
                            
                                SimpleJSON and NumPy array
                            
                                How can I split by 1 or more occurrences of a delimiter in Python?
                            
                                How to remove empty lines with or without whitespace in Python
                            
                                Unique session id in python
                            
                                Big Sur clang "invalid version" error due to MACOSX_DEPLOYMENT_TARGET
                            
                                How do I fit a sine curve to my data with pylab and numpy?
                            
                                How to write Strategy Pattern in Python differently than example in Wikipedia?
                            
                                Failure to use adaptiveThreshold: CV_8UC1 in function adaptiveThreshold
                            
                                What is the correct regex for matching values generated by uuid.uuid4().hex?
                            
                                How to control the "Updating skeletons" background task in PyCharm for IronPython interpreter?
                            
                                Run OpenERP 7 unittests in PyCharm
                            
                                Interactive graph visualisation
                            
                                How to merge multiple json objects into a single json object using python [duplicate]
                            
                                Why do CELERY_ROUTES have both a "queue" and a "routing_key"?
                            
                                Interactive matplotlib figures in Google Colab
                            
                                Using module's own objects in __main__.py
                            
                                Comparison of Python modes for Emacs
                            
                                Python debugger (pdb) stopped handlying up/down arrows, shows ^[[A instead
                            
                                Install python packages to correct anaconda environment
                            
                                How to set breakpoint in another module (don't set it on function definition line, if you want to break when function starts being executed)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract table as text from the PDF using Python?

Tags:

python

pdf

pdf-parsing

venkat

People also ask

1 Answers

Himanshu Poddar

Recent Activity

Donate For Us