What is the easiest way to get the text (words) of a PDF file as one long String or array of Strings.
I have tried pdfbox but that is not working for me.
Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.
use iText. The following snippet for example will extract the text.
PdfTextExtractor parser =new PdfTextExtractor(new PdfReader("C:/Text.pdf")); parser.getTextFromPage(3);
PDFBox barfs on many newer PDFs, especially those with embedded PNG images.
I was very impressed with PDFTextStream
JPedal
and Multivalent
also offer text extraction in Java
or you could access xpdf
using Runtime.exec
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With