Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDFBox: working with very large PDFs.

Tags:

java

pdfbox

I am working with some very large PDFs, some over 7GB in size. The PDFs have up to 20,000 pages and many full page color images. I'd like to use PDFBox to work with the PDFs, but due to the size I get OutOfMemoryError's when I attempt to open the PDFs.

I'm working with version pdfbox-app-1.6.0, on Windows 7 using Intellij, java 6.

First I tried writing a simple program that just opened the PDF in a PDDocument and coping each page over to another PDDocument: http://ideone.com/arKhB

Next I tried using the PDFBox CopyDoc example.

Both example run out of memory.

I'm assuming this is because PDFBox is trying to read the whole document into memory. Is there a way to have it only open 1 page at a time? I know it would be slower processing, but at the moment I can't process anything.

like image 825
Pengo Avatar asked Jul 02 '12 22:07

Pengo


1 Answers

In the 2.0.* versions, open the PDF like this:

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());

This will setup buffering memory usage to only use temporary file(s) (no main-memory) with not restricted size.

Update 17.4.2018: More tricks to save memory are described in the FAQ. Not yet described but active since 2.0.9 is subsampling (skip pixel lines/rows) with PDFRenderer.setSubsamplingAllowed(true) when doing rendering. This saves space for PDF files with huge image files.

like image 167
Tilman Hausherr Avatar answered Sep 19 '22 02:09

Tilman Hausherr