Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract everything from PDF [closed]

Looking for solution to extract content from a PDF file (using console tool or a library).

It will be used on server to produce on-line e-books from uploaded PDF files.

Need to extract following things:

  1. text with fonts and styles;
  2. images;
  3. audio and video;
  4. links and hotspots.
  5. page snapshots and thumbnails;
  6. general PDF information, e.g. book layouts, number of pages etc.

Looking at Adobe PDF Library ($5000 though), BCL SDK (?), PDFLib (€795), QuickPDF ($250)

Now we are using open source pdf2xml (extracts text, images and links) and GhostScript (snapshots and thumbnails). The other things left are:

  1. fonts;
  2. multimedia;
  3. hotspots;
  4. page info.

We are hesitating between paying a lot of money (and possibly make mistake choosing wrong solution) or use free/open source solutions.

Which BEST solution to extract nearly everything from PDF would you recommend?

Any comments will be much appreciated.

like image 366
Maksym Avatar asked Nov 12 '09 11:11

Maksym


People also ask

How do I open a PDF I just closed?

If you go file>open recent. A list containing your last 9 files will display and you can choose from there the one you want to open, or you can also click "history" and search files you opened today, yesterday 14 days ago, etc. Hope this helps.


1 Answers

Sounds like with a few days or weeks effort, you can adapt the open source tools to your needs. Fonts and everything can certainly be extracted, this is something that every PDF reader must do anyway to display them.

You should probably take an estimate of programmer costs ($/hr) and multiply it by the estimated time it would take to add the needed open source functionality (60-80 hours?). If this is greater or close to $5000 anyway, you might consider just buying the commercial software.

Otherwise, with the help of the (quite good) PDF reference, you should be well on your way.

One more thing, you might find Poppler to be of help. It is for rendering PDF, but that is very related to what you are trying to do.

like image 151
Adam Goode Avatar answered Oct 06 '22 23:10

Adam Goode