I'm trying to convert a pdf (my favorite book Effective Java, if its matter)to text, i checked both iText and Apache PdfBox. I see a really big difference in performance: With iText it took 2:521, and with PdfBox: 6:117. This if my code for PdfBOx <pre class="prettyprint"><code>PDFTextStripper stripper = new PDFTextStripper(); BUFFER.append(stripper.getText(PDDocument.load(pdf))); </code></pre> And this is for iText <pre class="prettyprint"><code>PdfReader reader = new PdfReader(pdf); for (int i = 1; i <= reader.getNumberOfPages(); i++) { BUFFER.append(PdfTextExtractor.getTextFromPage(reader, i)); } </code></pre> My question is in what the performance depends, is there a way how to make PdfBox faster? Or only to use iText? And can you explain more about how strategies affect performance?

<blockquote> My question is in what the performance depends, is there a way how to make PdfBox faster? </blockquote> One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot. Furthermore the event oriented architecture of iText text parsing means a lower burden on resources than that of PDFBox. And PDFBox keeps information not strictly required for plain text extraction available for a longer time, costing more resources. But the way the libraries initially load the document may also make a difference. Here you can experiment a bit, PDFBox not only offers multiple <code>PDDocument.load</code> overloads but also some <code>PDDocument.loadNonSeq</code> overloads (actually <code>PDDocument.loadNonSeq</code> reads documents correctly while <code>PDDocument.load</code> can be tricked to misinterpret PDFs). All these different variants may have different runtime behavior. <blockquote> more about how strategies affect performance? </blockquote> iText brings along a simple and a more advanced text extraction strategy. The simple one assumes text in the page content stream to appear in reading order while the more advanced one sorts. By default the more advanced one is used. Thus, you probably can speed up iText even some more by using the simple strategy. PDFBox always sorts.

Performance iText vs.PdfBox

Tags:

java

performance

pdfbox

itext

I'm trying to convert a pdf (my favorite book Effective Java, if its matter)to text, i checked both iText and Apache PdfBox. I see a really big difference in performance: With iText it took 2:521, and with PdfBox: 6:117. This if my code for PdfBOx

Click to copy

PDFTextStripper stripper = new PDFTextStripper();
BUFFER.append(stripper.getText(PDDocument.load(pdf)));

And this is for iText

Click to copy

PdfReader reader = new PdfReader(pdf);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
  BUFFER.append(PdfTextExtractor.getTextFromPage(reader, i));
}

My question is in what the performance depends, is there a way how to make PdfBox faster? Or only to use iText? And can you explain more about how strategies affect performance?

586

asked Mar 12 '14 02:03

meilechh

1 Answers

My question is in what the performance depends, is there a way how to make PdfBox faster?

One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot. Furthermore the event oriented architecture of iText text parsing means a lower burden on resources than that of PDFBox. And PDFBox keeps information not strictly required for plain text extraction available for a longer time, costing more resources.

But the way the libraries initially load the document may also make a difference. Here you can experiment a bit, PDFBox not only offers multiple PDDocument.load overloads but also some PDDocument.loadNonSeq overloads (actually PDDocument.loadNonSeq reads documents correctly while PDDocument.load can be tricked to misinterpret PDFs). All these different variants may have different runtime behavior.

more about how strategies affect performance?

iText brings along a simple and a more advanced text extraction strategy. The simple one assumes text in the page content stream to appear in reading order while the more advanced one sorts. By default the more advanced one is used. Thus, you probably can speed up iText even some more by using the simple strategy. PDFBox always sorts.

answered Sep 30 '22 12:09

mkl

Related questions
                            
                                How can I get the name of running Java VM?
                            
                                Why java ArrayIndexOutOfBound Exception Extends IndexOutofBound Exception not Throwable?
                            
                                Intermediate CA certificate in Java keystore
                            
                                Jackson loses time offset from dates when deserializing to JodaTime
                            
                                TRACE log level
                            
                                How do you make key binding for a JFrame no matter what JComponent is in focus?
                            
                                Custom Method Annotation using Jersey's AbstractHttpContextInjectable not Working
                            
                                Dropwizard - how to do a server side redirect from a view?
                            
                                Converting DBObject to Java Object while retrieve values from MongoDB
                            
                                How can I call from one servlet file to another servlet file? [duplicate]
                            
                                Is Initialization On Demand Holder idiom thread safe without a final modifier
                            
                                How to Deploy a jar File on a remote server from within Netbeans?
                            
                                How to use Caliper for benchmarking?
                            
                                JUnit crashes saying method should be static, then crashes saying it shouldn't?
                            
                                What actually happens if I return false in a OnTouchListener?
                            
                                "JavaMailSenderImpl cannot be resolved to a type" after switching to 4.0.1 Release
                            
                                BeanUtils converting java.util.Map to nested bean
                            
                                How to find the path of webapps directory using java
                            
                                What's the difference between a JavaScript object and an OO/UML/Java object?
                            
                                What causes the JVM to do a major garbage collection?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With