I would like to extract text from a given PDF file with Apache PDFBox. I wrote this code: <pre class="prettyprint"><code>PDFTextStripper pdfStripper = null; PDDocument pdDoc = null; COSDocument cosDoc = null; File file = new File(filepath); PDFParser parser = new PDFParser(new FileInputStream(file)); parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdfStripper.setStartPage(1); pdfStripper.setEndPage(5); String parsedText = pdfStripper.getText(pdDoc); System.out.println(parsedText); </code></pre> However, I got the following error: <pre class="prettyprint"><code>Exception in thread "main" java.lang.NullPointerException at org.apache.fontbox.afm.AFMParser.main(AFMParser.java:304) </code></pre> I added pdfbox-1.8.5.jar and fontbox-1.8.5.jar to the class path. Edit I added <code>System.out.println("program starts");</code> to the beginning of the program. I ran it, then I got the same error as mentioned above and <code>program starts</code> did not appear in the console. Thus, I think I have a problem with the class path or something. Thank you.

Using PDFBox 2.0.7, this is how I get the text of a PDF: <pre class="prettyprint"><code>static String getText(File pdfFile) throws IOException { PDDocument doc = PDDocument.load(pdfFile); return new PDFTextStripper().getText(doc); } </code></pre> Call it like this: <pre class="prettyprint"><code>try { String text = getText(new File("/home/me/test.pdf")); System.out.println("Text in PDF: " + text); } catch (IOException e) { e.printStackTrace(); } </code></pre> <hr> Since user oivemaria asked in the comments: You can use PDFBox in your application by adding it to your dependencies in <code>build.gradle</code>: <pre class="prettyprint"><code>dependencies { compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7' } </code></pre> Here's more on dependency management using Gradle. <hr> If you want to keep the PDF's format in the parsed text, give PDFLayoutTextStripper a try.

How to extract text from a PDF file with Apache PDFBox

Tags:

java

pdfbox

I would like to extract text from a given PDF file with Apache PDFBox.

I wrote this code:

PDFTextStripper pdfStripper = null; PDDocument pdDoc = null; COSDocument cosDoc = null; File file = new File(filepath);  PDFParser parser = new PDFParser(new FileInputStream(file)); parser.parse(); cosDoc = parser.getDocument(); pdfStripper = new PDFTextStripper(); pdDoc = new PDDocument(cosDoc); pdfStripper.setStartPage(1); pdfStripper.setEndPage(5); String parsedText = pdfStripper.getText(pdDoc); System.out.println(parsedText);

However, I got the following error:

Exception in thread "main" java.lang.NullPointerException at org.apache.fontbox.afm.AFMParser.main(AFMParser.java:304)

I added pdfbox-1.8.5.jar and fontbox-1.8.5.jar to the class path.

Edit

I added System.out.println("program starts"); to the beginning of the program.

I ran it, then I got the same error as mentioned above and program starts did not appear in the console.

Thus, I think I have a problem with the class path or something.

Thank you.

910

asked May 22 '14 17:05

Benben

1 Answers

Using PDFBox 2.0.7, this is how I get the text of a PDF:

static String getText(File pdfFile) throws IOException {     PDDocument doc = PDDocument.load(pdfFile);     return new PDFTextStripper().getText(doc); }

Call it like this:

try {     String text = getText(new File("/home/me/test.pdf"));     System.out.println("Text in PDF: " + text); } catch (IOException e) {     e.printStackTrace(); }

Since user oivemaria asked in the comments:

You can use PDFBox in your application by adding it to your dependencies in build.gradle:

dependencies {   compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7' }

Here's more on dependency management using Gradle.

If you want to keep the PDF's format in the parsed text, give PDFLayoutTextStripper a try.

158

answered Oct 05 '22 13:10

Matthias Braun

Related questions
                            
                                how to find the jar file containing a class definition? [closed]
                            
                                Invalid Thread Access Error with Java SWT
                            
                                Unknown error: Unable to build: the file dx.jar was not loaded from the SDK folder
                            
                                How to compare character ignoring case in primitive types
                            
                                Convert text content to Image
                            
                                Bold black cursor in Eclipse deletes code, and I don't know how to get rid of it
                            
                                Why "implements Runnable" is Preferred over "extends Thread"? [duplicate]
                            
                                The type javax.servlet.ServletContext and javax.servlet.ServletException cannot be resolved
                            
                                Eclipse is confused by imports ("accessible from more than one module")
                            
                                Where is the "work" directory located for a Tomcat instance running in Eclipse?
                            
                                javax.validation.ValidationException: Unable to find default provider
                            
                                HQL query for entity with max value
                            
                                JAVA_HOME does not point to the JDK
                            
                                How to create a Java 8 Stream from an iterator?
                            
                                Maven not running Spring Boot tests
                            
                                spring-boot health not showing details (withDetail info)
                            
                                Convert Month String to Integer in Java
                            
                                How to stop execution after a certain time in Java?
                            
                                Does Java have lazy evaluation?
                            
                                Eclipse: "The import java.io cannot be resolved"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With