Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate big PDF from huge amount of data

I read data from database from which I generate HTML DOM. The data volume is huge so it cannot fit in memory at once, however it can be provided chunk-by-chunk.

I would like to transform resulting HTML into PDF using Flying Saucer:

import org.xhtmlrenderer.pdf.ITextRenderer;
import org.dom4j.DocumentFactory;
import org.dom4j.Element;
import org.dom4j.io.DOMWriter;

OutputStream bodyStream = outputMessage.getBody();

ITextRenderer renderer = new ITextRenderer();

DocumentFactory documentFactory = DocumentFactory.getInstance();
DOMWriter domWriter = new DOMWriter();

Element htmlNode = documentFactory.createElement("html");
Document htmlDocument = documentFactory.createDocument(htmlNode);

int currentLine = 1;
int currentPage = 1;

try {
    while (currentLine <= numberOfLines) {
        currentLine += loadDataToDOM(documentFactory, htmlNode, currentLine, CHUNK_SIZE);

        renderer.setDocument(domWriter.write(htmlDocument), null);
        renderer.layout();

        if (currentPage == 1) {
            // For the first page the PDF writer is created:
            renderer.createPDF(bodyStream, false);
        }
        else {
            // Other documents are appended to current PDF writer:
            renderer.writeNextDocument(currentPage);
        }

        currentPage += renderer.getRootBox().getLayer().getPages().size();
    }

    // Finalise the PDF:
    renderer.finishPDF();
}
catch (DocumentException e) {
    throw new IOException(e);
}
catch (org.dom4j.DocumentException e) {
    throw new IOException(e);
}
finally {
    IOUtils.closeQuietly(bodyStream);
}

The problem with this approach is that the last page of chunk is not necessarily completely filled with data. Is there any solution to fill the space? For example I could think about the approach that will check that last page is not filed completely and then discard it (not write to PDF), also find out which data was rendered on that page and rewind the position in database (currentLine in example). Would be nice if one can post a complete solution.

like image 553
dma_k Avatar asked Jun 25 '14 16:06

dma_k


2 Answers

As I already mentioned in the comments, you are wasting memory and processing time by creating a PDF from a data source by creating HTML first and then converting the HTML to PDF. You're also introducing plenty of unnecessary complexity.

In your comment, you mention low-level functionality such as moveTo() and lineTo(). It would indeed be madness to draw a table using low-level operations that draw every single line and ever single word.

You should use the PdfPTable class. The ArrayToTable example is a very simple POC where the data comes in the form of a List<List<String>>. The code is as simple as this:

PdfPTable table = new PdfPTable(8);
table.setWidthPercentage(100);
List<List<String>> dataset = getData();
for (List<String> record : dataset) {
    for (String field : record) {
        table.addCell(field);
    }
}
document.add(table);

Of course: you are talking about a huge data set, in which case, you may not want to build up the table in memory first and then flush the memory when the table is added to the document. You'll want to add small parts of the table while you are building it. That's what happens in the MemoryTests example. Add this line:

table.setComplete(false);

And you can add the table little by little (in the example: every 10 rows). When you've finished adding cells to the table, you should do this:

table.setComplete(true);
document.add(table);

This will add the final rows.

If you want a table with a repeating header and/or footer, take a look at the tables in this PDF: header_footer_1.pdf

The HeaderFooter1 and HeaderFooter2 examples will show you how it's done.

like image 62
Bruno Lowagie Avatar answered Nov 13 '22 00:11

Bruno Lowagie


This is not an answer to the precise question you asked, so if this post is useless I'll delete it.

Since the document is huge, you may well get the best results by emitting the data as LaTeX and then running it through pdflatex.

Advantages:

  • LaTeX source of the kind you need is simple to emit - no more complicated than HTML.
  • The whole TeX system is designed to produce beautiful and huge documents. LaTeX is processed as a stream of pages. The number of pages has essentially no effect on RAM resources required.
  • You get the full power of a typesetting language to make your pages look great. Want fancy headers? Nicely positioned page numbers? Section headings? Clickable Table of Contents, etc. etc. No problem.
  • LaTeX is available free for all major operating systems.

Disadvantages:

  • LaTeX is a native executable, not a Java lib.

If you are interested in this, I can flesh out more details.

like image 37
Gene Avatar answered Nov 13 '22 01:11

Gene