My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs. I have tried a few of different things, but I did not get very far in any of them: <ul> <li>Convert PDF to text. It does not work for me as I lose images and the structure of the document.</li> <li>Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.</li> <li>Convert PDF to XML. Same as above.</li> </ul> Anyone has any suggestions on how to tackle this problem?

You may do use the following approach like this with iTextSharp or other open source libraries: <ul> <li>Read PDF file with with iTextSharp or similar open source tools and collect all text objects into an array (or convert PDF to HTML using the tool like pdftohtml and then parse HTML)</li> <li>Sort all text objects by coordinates so you will have them all together</li> <li>Then iterate through objects and check the distance between them to see if 2 or more objects can be merged into one paragraph or not</li> </ul> Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this: <ul> <li>extract text and images along with analyzing the layout of the text </li> <li>XML or CSV where text objects are merged or splitted into paragraphs inside a virtual layout grid</li> <li>access objects via special API that makes it possible to address each object via its "virtual" row and column index disregarding how it is stored inside the original PDF.</li> </ul> Disclaimer: I am affiliated with ByteScout

PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals. Grobid available as a opensource on github. https://github.com/kermitt2/grobid

How to extract data from a PDF file while keeping track of its structure?

Tags:

parsing

pdf

extraction

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.

I have tried a few of different things, but I did not get very far in any of them:

Convert PDF to text. It does not work for me as I lose images and the structure of the document.
Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
Convert PDF to XML. Same as above.

Anyone has any suggestions on how to tackle this problem?

464

asked Jun 02 '09 03:06

Marcel

4 Answers

There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).

On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".

To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).

188

answered Oct 05 '22 23:10

David van Driessche

You may do use the following approach like this with iTextSharp or other open source libraries:

Read PDF file with with iTextSharp or similar open source tools and collect all text objects into an array (or convert PDF to HTML using the tool like pdftohtml and then parse HTML)
Sort all text objects by coordinates so you will have them all together
Then iterate through objects and check the distance between them to see if 2 or more objects can be merged into one paragraph or not

Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this:

extract text and images along with analyzing the layout of the text
XML or CSV where text objects are merged or splitted into paragraphs inside a virtual layout grid
access objects via special API that makes it possible to address each object via its "virtual" row and column index disregarding how it is stored inside the original PDF.

Disclaimer: I am affiliated with ByteScout

answered Oct 05 '22 21:10

Eugene

PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.

Grobid available as a opensource on github.

https://github.com/kermitt2/grobid

answered Oct 05 '22 22:10

Vaibhav Panmand

PDF files can be parsed with tabula-py, or tabula-java.

I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.

answered Oct 05 '22 22:10

Eric Kim

Related questions
                            
                                Split each PDF page in two?
                            
                                How to downsample images within PDF file?
                            
                                Determine if a byte[] is a pdf file
                            
                                DomPDF: Image not readable or empty
                            
                                Removing PDF invisible objects with iTextSharp
                            
                                Javascript call programmatically the "Save as PDF" feature of Chrome dialog print
                            
                                Is there a field in which PDF files specify their encoding?
                            
                                Scale down to fit an image in FOP
                            
                                If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
                            
                                Extracting text from PDFs in C# [closed]
                            
                                PDFTK Rotating Pages Problem
                            
                                How do I embed fonts in an existing PDF?
                            
                                Opening a pdf and reading in tables with python pandas
                            
                                Is there a web service for converting HTML to PDF? [closed]
                            
                                Convert PDF to JPEG with PHP and ImageMagick
                            
                                How to generate pdf from a libreoffice calc sheet fitting the page width?
                            
                                Mercurial and Word or PDF documents
                            
                                android: open a pdf from my app using the built in pdf viewer
                            
                                How to limit RDLC report for one page in a PDF ?
                            
                                How to set encoding in PHP FPDI library

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With