I would like to extract text from a given PDF file with Apache PDFBox.
I wrote this code:
PDFTextStripper pdfStripper = null; PDDocument pdDoc = null; COSDocument cosDoc = null; File file = new File(filepath); PDFParser parser = new PDFParser(new FileInputStream(file)); parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdfStripper.setStartPage(1); pdfStripper.setEndPage(5); String parsedText = pdfStripper.getText(pdDoc); System.out.println(parsedText);
However, I got the following error:
Exception in thread "main" java.lang.NullPointerException at org.apache.fontbox.afm.AFMParser.main(AFMParser.java:304)
I added pdfbox-1.8.5.jar and fontbox-1.8.5.jar to the class path.
Edit
I added System.out.println("program starts");
to the beginning of the program.
I ran it, then I got the same error as mentioned above and program starts
did not appear in the console.
Thus, I think I have a problem with the class path or something.
Thank you.
Once you've opened the file, click on the "Edit" tab, and then click on the "edit" icon. Now you can right-click on the text and select "Copy" to extract the text you need.
One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot.
Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.
Using PDFBox 2.0.7, this is how I get the text of a PDF:
static String getText(File pdfFile) throws IOException { PDDocument doc = PDDocument.load(pdfFile); return new PDFTextStripper().getText(doc); }
Call it like this:
try { String text = getText(new File("/home/me/test.pdf")); System.out.println("Text in PDF: " + text); } catch (IOException e) { e.printStackTrace(); }
Since user oivemaria asked in the comments:
You can use PDFBox in your application by adding it to your dependencies in build.gradle
:
dependencies { compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7' }
Here's more on dependency management using Gradle.
If you want to keep the PDF's format in the parsed text, give PDFLayoutTextStripper a try.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With