Looking for solution to extract content from a PDF file (using console tool or a library).
It will be used on server to produce on-line e-books from uploaded PDF files.
Need to extract following things:
Looking at Adobe PDF Library ($5000 though), BCL SDK (?), PDFLib (€795), QuickPDF ($250)
Now we are using open source pdf2xml (extracts text, images and links) and GhostScript (snapshots and thumbnails). The other things left are:
We are hesitating between paying a lot of money (and possibly make mistake choosing wrong solution) or use free/open source solutions.
Which BEST solution to extract nearly everything from PDF would you recommend?
Any comments will be much appreciated.
If you go file>open recent. A list containing your last 9 files will display and you can choose from there the one you want to open, or you can also click "history" and search files you opened today, yesterday 14 days ago, etc. Hope this helps.
Sounds like with a few days or weeks effort, you can adapt the open source tools to your needs. Fonts and everything can certainly be extracted, this is something that every PDF reader must do anyway to display them.
You should probably take an estimate of programmer costs ($/hr) and multiply it by the estimated time it would take to add the needed open source functionality (60-80 hours?). If this is greater or close to $5000 anyway, you might consider just buying the commercial software.
Otherwise, with the help of the (quite good) PDF reference, you should be well on your way.
One more thing, you might find Poppler to be of help. It is for rendering PDF, but that is very related to what you are trying to do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With