Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

From PDf to String

Tags:

java

text

io

pdf

What is the easiest way to get the text (words) of a PDF file as one long String or array of Strings.

I have tried pdfbox but that is not working for me.

like image 290
Ankur Avatar asked Nov 05 '09 04:11

Ankur


People also ask

Can I convert PDF image to text?

Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.


3 Answers

use iText. The following snippet for example will extract the text.

PdfTextExtractor parser =new PdfTextExtractor(new PdfReader("C:/Text.pdf"));
parser.getTextFromPage(3);

like image 93
Kushal Paudyal Avatar answered Nov 02 '22 19:11

Kushal Paudyal


PDFBox barfs on many newer PDFs, especially those with embedded PNG images.

I was very impressed with PDFTextStream

like image 43
Sam Barnum Avatar answered Nov 02 '22 20:11

Sam Barnum


JPedal and Multivalent also offer text extraction in Java or you could access xpdf using Runtime.exec

like image 1
mark stephens Avatar answered Nov 02 '22 20:11

mark stephens