Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working on tables in pdf using python

I am working on a pdf file. There is number of tables in that pdf.
According to the table names given in the pdf, I wanted to fetch the data from that table using python.

I have worked on html, xlm parsing but never with pdf.
Can anyone tell me how to fetch tables from pdf using python?

like image 636
sam Avatar asked Nov 30 '22 15:11

sam


2 Answers

I think that you need a python parser library. The most famous is PDFMiner.

According to the documentation :

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

like image 120
Sandro Munda Avatar answered Dec 05 '22 02:12

Sandro Munda


This is a very complex problem and not solvable in general.

The reason for this is simply that the format PDF is too flexible. Some PDFs are only bitmaps (you would have to do your own OCR then—obviously not our topic here), some are a bunch of letters literally spilled out over the pages; this means that by parsing the text information in the PDF you could get single characters placed on some coordinates. In some cases these come in an orderly fashion (line by line, from left to right), but in some cases you will get rather random-like distributions, most commonly with and stuff, but also special characters, characters in a different font etc. can come way out of line.

The only proper approach is to place all characters according to their coordinates on a page model and then use heuristics to find out what the lines are.

I propose to have a look at your PDFs and the tables therein you want to parse before starting. Maybe they are alike all the time and well-parsable.

Good luck!

like image 35
Alfe Avatar answered Dec 05 '22 02:12

Alfe