Pretty simply, I need to rip text out of multiple PDFs (quite a lot actually) in order to analyse the contents before sticking it in an SQL database.
I've found some pretty sketchy free C# libraries that sort of work (the best one uses iTextSharp), but there are umpteen formatting errors and some characters are scrambled and alot of the time there are spaces (' ') EVERYWHERE - inside words, between every letter, huge blocks of them taking up several lines, it all seems a bit random.
Is there any easy way of doing this that I'm completely overlooking (quite likely!) or is it a bit of an arduous task that involves converting the extracted byte values into letters reliably?
Once you've opened the file, click on the "Edit" tab, and then click on the "edit" icon. Now you can right-click on the text and select "Copy" to extract the text you need.
Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.
Since that's what your looking for : PoDoFo is C++ library to parse/read/modify or create pdf files. The library is cross-platform.
There may be some difficulty in doing this reliably. The problem is that PDF is a presentation format which attaches importance to good typography. Suppose you just wanted to output a single word: Tap.
A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code:
moveto (x1, y); output ("T") moveto (x2, y); output ("ap")
This would be done because the default kerning (inter-letter spacing) between the letters T and a might not be acceptable to the rendering engine, or it might be adding or removing some micro space between characters to get a fully justified line. What this finally results in is that the actual text fragments found in PDF are very often not full words, but pieces of them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With