Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text from a PDF file with Apache PDFBox

Tags:

java

pdfbox

I would like to extract text from a given PDF file with Apache PDFBox.

I wrote this code:

PDFTextStripper pdfStripper = null; PDDocument pdDoc = null; COSDocument cosDoc = null; File file = new File(filepath);  PDFParser parser = new PDFParser(new FileInputStream(file)); parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdfStripper.setStartPage(1); pdfStripper.setEndPage(5); String parsedText = pdfStripper.getText(pdDoc); System.out.println(parsedText); 

However, I got the following error:

Exception in thread "main" java.lang.NullPointerException at org.apache.fontbox.afm.AFMParser.main(AFMParser.java:304) 

I added pdfbox-1.8.5.jar and fontbox-1.8.5.jar to the class path.

Edit

I added System.out.println("program starts"); to the beginning of the program.

I ran it, then I got the same error as mentioned above and program starts did not appear in the console.

Thus, I think I have a problem with the class path or something.

Thank you.

like image 910
Benben Avatar asked May 22 '14 17:05

Benben


People also ask

How can I extract text from a PDF?

Once you've opened the file, click on the "Edit" tab, and then click on the "edit" icon. Now you can right-click on the text and select "Copy" to extract the text you need.

Which is better iText or PDFBox?

One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot.

How can I extract text from a PDF image?

Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.


1 Answers

Using PDFBox 2.0.7, this is how I get the text of a PDF:

static String getText(File pdfFile) throws IOException {     PDDocument doc = PDDocument.load(pdfFile);     return new PDFTextStripper().getText(doc); } 

Call it like this:

try {     String text = getText(new File("/home/me/test.pdf"));     System.out.println("Text in PDF: " + text); } catch (IOException e) {     e.printStackTrace(); } 

Since user oivemaria asked in the comments:

You can use PDFBox in your application by adding it to your dependencies in build.gradle:

dependencies {   compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7' } 

Here's more on dependency management using Gradle.


If you want to keep the PDF's format in the parsed text, give PDFLayoutTextStripper a try.

like image 158
Matthias Braun Avatar answered Oct 05 '22 13:10

Matthias Braun