Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell: parsing PDF

Tags:

pdf

haskell

What I need is to read pdf, make some transformations (generate TOC bookmarks) and write it back.

I found this http://hackage.haskell.org/package/HPDF , but it only mentions generating pdf, not the parsing (although I could have missed it)

Haskell is chosen purely for (self)educational purposes.

like image 636
artemave Avatar asked Mar 05 '10 18:03

artemave


People also ask

How does parsing work PDF?

So, how does PDF parser work? A PDF parser goes down to the foundational blocks of a PDF document and uses an algorithm to identify the types of data included in the document. A well-trained PDF parser will be able to identify all basic types of document elements.

What is PDF parser tool?

A PDF parser, or PDF scraper, is a tool that extracts data from PDF documents. Document parsing is a popular approach to extract text, images or data from inaccessible formats such as PDFs.

What is a parser Haskell?

In a functional language such as Haskell, parsers can naturally be viewed as functions. type Parser = String Tree. A parser is a function that takes a string and returns some form of tree.

What is a monadic parser?

A Parser combinator, as wikipedia describes it, is a higher-order function that accepts several parsers as input and returns a new parser as its output. They can be very powerful when you want to build modular parsers and leave them open for further extension.


2 Answers

Checkout pdf-toolbox library. It's support for PDF file generating is low level, but powerful enough for your task.

Here is an example how to change title of an existing PDF file using incremental update feature.

like image 190
Yuras Avatar answered Oct 03 '22 19:10

Yuras


There are a few tools for PDF manipulation, though they seem to bias towards generation, rather than parsing:

  • http://johnmacfarlane.net/pandoc/

Pandoc is a great cross-markup library, but doesn't support PDF parsing (it does support PDF generation from a variety of formats).

There's also:

  • http://hackage.haskell.org/package/HsHaruPDF
  • http://hackage.haskell.org/package/pdf2line -- tool for extracting text from pdf
  • http://hackage.haskell.org/package/HPDF -- another pdf generation library

I'm not sure we have a good parsing tool yet.

like image 31
Don Stewart Avatar answered Oct 03 '22 19:10

Don Stewart