Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDFbox loading large files

Tags:

java

pdfbox

I'm trying to convert the first page of a pdf file to image using PDFBox. When i'm loading a large pdf file i get an exception.

code:

    PDDocument doc;
    try {
        InputStream input  = new URL("http://www.jewishfederations.org/local_includes/downloads/39497.pdf").openStream();
        doc = PDDocument.load(input);
        PDPage firstPage = (PDPage) doc.getDocumentCatalog().getAllPages().get(0);
        BufferedImage image =firstPage.convertToImage();
        File outputfile = new File("image2.png");
        ImageIO.write(image, "png", outputfile);
        input.close();
        doc.close();

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

exception:

org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 72435 is wrong. Fall back to reading stream until 'endstream'.
org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 72435 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:554)
    at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
    at Worker.main(Worker.java:27)
Caused by: java.io.IOException: Push back buffer is full
    at java.io.PushbackInputStream.unread(Unknown Source)
    at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:144)
    at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:133)
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:550)
    ... 5 more
like image 226
user2958571 Avatar asked Apr 08 '14 19:04

user2958571


People also ask

Which is better iText or PDFBox?

One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot.

Is PDFBox thread safe?

Is PDFBox thread safe? No! Only one thread may access a single document at a time.

Is PDFBox free to use?

Bookmark this question. Show activity on this post. PDFbox is that PDFbox is the free version.


1 Answers

An alternative solution for the 1.8.* PDFBox versions is to use the non-sequential parser. In that case, the code would not be

doc = PDDocument.load(input);

but

doc = PDDocument.loadNonSeq(input, null);

that parser (which will be the only one in the upcoming 2.0 version) is independent of the size of a pushback buffer.

like image 102
Tilman Hausherr Avatar answered Sep 20 '22 20:09

Tilman Hausherr