I read data from database from which I generate HTML DOM. The data volume is huge so it cannot fit in memory at once, however it can be provided chunk-by-chunk.
I would like to transform resulting HTML into PDF using Flying Saucer:
import org.xhtmlrenderer.pdf.ITextRenderer;
import org.dom4j.DocumentFactory;
import org.dom4j.Element;
import org.dom4j.io.DOMWriter;
OutputStream bodyStream = outputMessage.getBody();
ITextRenderer renderer = new ITextRenderer();
DocumentFactory documentFactory = DocumentFactory.getInstance();
DOMWriter domWriter = new DOMWriter();
Element htmlNode = documentFactory.createElement("html");
Document htmlDocument = documentFactory.createDocument(htmlNode);
int currentLine = 1;
int currentPage = 1;
try {
while (currentLine <= numberOfLines) {
currentLine += loadDataToDOM(documentFactory, htmlNode, currentLine, CHUNK_SIZE);
renderer.setDocument(domWriter.write(htmlDocument), null);
renderer.layout();
if (currentPage == 1) {
// For the first page the PDF writer is created:
renderer.createPDF(bodyStream, false);
}
else {
// Other documents are appended to current PDF writer:
renderer.writeNextDocument(currentPage);
}
currentPage += renderer.getRootBox().getLayer().getPages().size();
}
// Finalise the PDF:
renderer.finishPDF();
}
catch (DocumentException e) {
throw new IOException(e);
}
catch (org.dom4j.DocumentException e) {
throw new IOException(e);
}
finally {
IOUtils.closeQuietly(bodyStream);
}
The problem with this approach is that the last page of chunk is not necessarily completely filled with data. Is there any solution to fill the space? For example I could think about the approach that will check that last page is not filed completely and then discard it (not write to PDF), also find out which data was rendered on that page and rewind the position in database (currentLine
in example). Would be nice if one can post a complete solution.
As I already mentioned in the comments, you are wasting memory and processing time by creating a PDF from a data source by creating HTML first and then converting the HTML to PDF. You're also introducing plenty of unnecessary complexity.
In your comment, you mention low-level functionality such as moveTo()
and lineTo()
. It would indeed be madness to draw a table using low-level operations that draw every single line and ever single word.
You should use the PdfPTable
class. The ArrayToTable
example is a very simple POC where the data comes in the form of a List<List<String>>
. The code is as simple as this:
PdfPTable table = new PdfPTable(8);
table.setWidthPercentage(100);
List<List<String>> dataset = getData();
for (List<String> record : dataset) {
for (String field : record) {
table.addCell(field);
}
}
document.add(table);
Of course: you are talking about a huge data set, in which case, you may not want to build up the table
in memory first and then flush the memory when the table is added to the document. You'll want to add small parts of the table while you are building it. That's what happens in the MemoryTests
example. Add this line:
table.setComplete(false);
And you can add the table little by little (in the example: every 10 rows). When you've finished adding cells to the table, you should do this:
table.setComplete(true);
document.add(table);
This will add the final rows.
If you want a table with a repeating header and/or footer, take a look at the tables in this PDF: header_footer_1.pdf
The HeaderFooter1
and HeaderFooter2
examples will show you how it's done.
This is not an answer to the precise question you asked, so if this post is useless I'll delete it.
Since the document is huge, you may well get the best results by emitting the data as LaTeX and then running it through pdflatex
.
Advantages:
Disadvantages:
If you are interested in this, I can flesh out more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With