I have a stack of PDFs - potentially hundreds or thousands. They are not all formatted the same, but any of them MAY have one or more tables with interesting information that I would like to collect into a separate database. Of course, I know I have to write something to do this. Perl is an option for me - or perhaps Java. I don't really care what language so long as it's free (or cheap with a free trial period to ensure it suits my purposes). I'm looking at CAM::Parse (using strawberry Perl), but I'm not sure how to use it to locate and extract tables from the files. I guess I do have a preference for Perl, but really I want something that works dependably and is reasonably easy to do string manipulations with. What is a good approach for something like this? I'm at square one, so if java (or python etc.) have better hooks, now is a good time to know about it. General pointers good; starter code would be strongly preferred.

<ol> <li>The PDF format from its inception (more than 20 years ago) never was intended to be host of extractable, meaningfully structured data. </li> <li>Its purpose was to be a reliable visual representation of text, images and diagrams in a document -- a kind of digital paper (that would also reliably be transferred to real paper via printing). Only later in its development more features were added, which should help in extracting data again (google for Tagged PDF).</li> <li> For some examples of problems which are posed when data scraping tables from PDFs, see this article: <ul> <li> Why Updating Dollars for Docs Was So Difficult </li> </ul> </li> <li> Contradicting my point '1.' above, now I say this: for an amazing family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages), see these links: <ul> <li>Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!</li> <li>Tabula-Extractor: A Command Line Interface to Tabula</li> <li>Tabula source code repository</li> <li>Tabula API (upcoming, not ready yet)</li> </ul> </li> </ol> So: go look for Tabula. If any tools can do what you want, at this time Tabula is probably amongst the best for the job! <hr> <h3>Update</h3> I've recently created an ASCiinema screencast demonstrating the use of the Tabula command line interface to extract a big table from a PDF as CSV: <img src="https://asciinema.org/a/22761.png" alt="asciicast"> (Click on image above to see it running. If it runs too fast for you to read all text, make use of the "Pause" button (<code>||</code>-symbol).) It is hosted here: <ul> <li> https://asciinema.org/a/22761 </li> </ul>

Extracting table contents from a collection of PDF files [closed]

Tags:

parsing

pdf

extract

pdf-parsing

I have a stack of PDFs - potentially hundreds or thousands. They are not all formatted the same, but any of them MAY have one or more tables with interesting information that I would like to collect into a separate database.

Of course, I know I have to write something to do this. Perl is an option for me - or perhaps Java. I don't really care what language so long as it's free (or cheap with a free trial period to ensure it suits my purposes).

I'm looking at CAM::Parse (using strawberry Perl), but I'm not sure how to use it to locate and extract tables from the files. I guess I do have a preference for Perl, but really I want something that works dependably and is reasonably easy to do string manipulations with.

What is a good approach for something like this? I'm at square one, so if java (or python etc.) have better hooks, now is a good time to know about it. General pointers good; starter code would be strongly preferred.

794

asked Jun 20 '13 15:06

elbillaf

1 Answers

The PDF format from its inception (more than 20 years ago) never was intended to be host of extractable, meaningfully structured data.
Its purpose was to be a reliable visual representation of text, images and diagrams in a document -- a kind of digital paper (that would also reliably be transferred to real paper via printing). Only later in its development more features were added, which should help in extracting data again (google for Tagged PDF).
For some examples of problems which are posed when data scraping tables from PDFs, see this article:
- Why Updating Dollars for Docs Was So Difficult
Contradicting my point '1.' above, now I say this: for an amazing family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages), see these links:
- Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
- Tabula-Extractor: A Command Line Interface to Tabula
- Tabula source code repository
- Tabula API (upcoming, not ready yet)

So: go look for Tabula. If any tools can do what you want, at this time Tabula is probably amongst the best for the job!

Update

I've recently created an ASCiinema screencast demonstrating the use of the Tabula command line interface to extract a big table from a PDF as CSV:

asciicast

(Click on image above to see it running. If it runs too fast for you to read all text, make use of the "Pause" button (||-symbol).)

It is hosted here:

https://asciinema.org/a/22761

149

answered Oct 04 '22 07:10

Kurt Pfeifle

Related questions
                            
                                Code Golf: Mathematical expression evaluator (that respects PEMDAS)
                            
                                Parse Delimited CSV in .NET
                            
                                Any idea how to avoid this assertion in DDTokenCache and what it means?
                            
                                Learning incremental compilation design
                            
                                Scala parser combinators vs ANTLR/Java generated parser?
                            
                                What is the advantage of using a parser generator like happy as opposed to using parser combinators?
                            
                                How can I obtain the named arguments from a console application in the form of a Dictionary<string,string>?
                            
                                Best way to process large XML in PHP [duplicate]
                            
                                ffmpeg Progress Bar - Encoding Percentage in PHP
                            
                                How do you parse a filename in bash?
                            
                                Parse JSON String to JSON Object in C#.NET
                            
                                HtmlAgility - Save parsing to a string
                            
                                JavaScript can't convert Hindi/Arabic numbers to real numeric variables
                            
                                How can I parse dates and convert time zones in Perl?
                            
                                XPath to Parse "SRC" from IMG tag?
                            
                                Parsing hostname and port from string or url
                            
                                Parse URL in shell script
                            
                                Why is bottom-up parsing more common than top-down parsing?
                            
                                C# generic string parse to any object
                            
                                Read/Parse Binary files with Powershell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With