Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance iText vs.PdfBox

I'm trying to convert a pdf (my favorite book Effective Java, if its matter)to text, i checked both iText and Apache PdfBox. I see a really big difference in performance: With iText it took 2:521, and with PdfBox: 6:117. This if my code for PdfBOx

PDFTextStripper stripper = new PDFTextStripper();
BUFFER.append(stripper.getText(PDDocument.load(pdf)));

And this is for iText

PdfReader reader = new PdfReader(pdf);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
  BUFFER.append(PdfTextExtractor.getTextFromPage(reader, i));
}

My question is in what the performance depends, is there a way how to make PdfBox faster? Or only to use iText? And can you explain more about how strategies affect performance?

like image 586
meilechh Avatar asked Mar 12 '14 02:03

meilechh


People also ask

Is PDFBox free for commercial use?

Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative ...

Is PDFBox thread safe?

Is PDFBox thread safe? No! Only one thread may access a single document at a time. You can have multiple threads each accessing their own PDDocument object.

What is PDFBox used for?

Apache PDFBox is an open source Java library that can be used to create, render, print, split, merge, alter, verify and extract text and meta-data of PDF files.

Is iText free for commercial use?

This license is a commercial license. You have to pay for it. To answer your question: iText can be used for free in situations where you also distribute your software for free. As soon as you want to use iText in a closed source, proprietary environment, you have to pay for your use of iText.


1 Answers

My question is in what the performance depends, is there a way how to make PdfBox faster?

One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot. Furthermore the event oriented architecture of iText text parsing means a lower burden on resources than that of PDFBox. And PDFBox keeps information not strictly required for plain text extraction available for a longer time, costing more resources.

But the way the libraries initially load the document may also make a difference. Here you can experiment a bit, PDFBox not only offers multiple PDDocument.load overloads but also some PDDocument.loadNonSeq overloads (actually PDDocument.loadNonSeq reads documents correctly while PDDocument.load can be tricked to misinterpret PDFs). All these different variants may have different runtime behavior.

more about how strategies affect performance?

iText brings along a simple and a more advanced text extraction strategy. The simple one assumes text in the page content stream to appear in reading order while the more advanced one sorts. By default the more advanced one is used. Thus, you probably can speed up iText even some more by using the simple strategy. PDFBox always sorts.

like image 93
mkl Avatar answered Sep 30 '22 12:09

mkl