What I need is to read pdf, make some transformations (generate TOC bookmarks) and write it back.
I found this http://hackage.haskell.org/package/HPDF , but it only mentions generating pdf, not the parsing (although I could have missed it)
Haskell is chosen purely for (self)educational purposes.
So, how does PDF parser work? A PDF parser goes down to the foundational blocks of a PDF document and uses an algorithm to identify the types of data included in the document. A well-trained PDF parser will be able to identify all basic types of document elements.
A PDF parser, or PDF scraper, is a tool that extracts data from PDF documents. Document parsing is a popular approach to extract text, images or data from inaccessible formats such as PDFs.
In a functional language such as Haskell, parsers can naturally be viewed as functions. type Parser = String Tree. A parser is a function that takes a string and returns some form of tree.
A Parser combinator, as wikipedia describes it, is a higher-order function that accepts several parsers as input and returns a new parser as its output. They can be very powerful when you want to build modular parsers and leave them open for further extension.
Checkout pdf-toolbox library. It's support for PDF file generating is low level, but powerful enough for your task.
Here is an example how to change title of an existing PDF file using incremental update feature.
There are a few tools for PDF manipulation, though they seem to bias towards generation, rather than parsing:
Pandoc is a great cross-markup library, but doesn't support PDF parsing (it does support PDF generation from a variety of formats).
There's also:
I'm not sure we have a good parsing tool yet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With