I'm trying to convert the first page of a pdf file to image using PDFBox. When i'm loading a large pdf file i get an exception.
code:
PDDocument doc;
try {
InputStream input = new URL("http://www.jewishfederations.org/local_includes/downloads/39497.pdf").openStream();
doc = PDDocument.load(input);
PDPage firstPage = (PDPage) doc.getDocumentCatalog().getAllPages().get(0);
BufferedImage image =firstPage.convertToImage();
File outputfile = new File("image2.png");
ImageIO.write(image, "png", outputfile);
input.close();
doc.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
exception:
org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 72435 is wrong. Fall back to reading stream until 'endstream'.
org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 72435 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:554)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
at Worker.main(Worker.java:27)
Caused by: java.io.IOException: Push back buffer is full
at java.io.PushbackInputStream.unread(Unknown Source)
at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:144)
at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:133)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:550)
... 5 more
One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot.
Is PDFBox thread safe? No! Only one thread may access a single document at a time.
Bookmark this question. Show activity on this post. PDFbox is that PDFbox is the free version.
An alternative solution for the 1.8.* PDFBox versions is to use the non-sequential parser. In that case, the code would not be
doc = PDDocument.load(input);
but
doc = PDDocument.loadNonSeq(input, null);
that parser (which will be the only one in the upcoming 2.0 version) is independent of the size of a pushback buffer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With