Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate pdf documents page-by-page in background tasks on App Engine

Tags:

People also ask

How do I attach a PDF?

Drag and drop a PDF file, then insert pages. Select a PDF, then insert pages. Select a PDF, then insert pages. Your file will be uploaded to Adobe cloud storage.


I need to generate a 100+ pages PDF documents. The process take a lot of data to process, and all-at-once generation takes more time and memory that I can give.

I have tried a few different methods to hack my way though:

  • xhtml2pdf with HTML generation and conversion
  • rportlab to generate some pages and
  • pyPdf for merging

With varying result I got it working, but it is slow and takes more memory than it should (sometimes hitting instance soft memory limit). Currently I generate some sections in different tasks storing each in blobstore and merge those with pyPdf, but it chokes on larger documents.

The document I'm generating is not that complicated, mostly tables and text, no internal references, no TOC, no anything that should be aware of the rest of the document. I can live with platypus for layouting and I do not need no fancy document look or HTML2PDF conversion.

The goal is to generate the document as fast as datastore will allow it. Parallel page generation would be nice but is not required.

I was thinking of page-by-page generation with blobstore files api, where each task would generate a single page and last task would finalize blobstore file making it readable. But I cant seem to find on how to, pause generation, store partial PDF to stream, and them resume generation with that stream to generate next page in a different task.

So my question is:

How on GAE generate a larger than a few pages PDF document, splitting the generation between task requests, then store the resulting document in the blobstore?

If generation splitting is not possible with reportlab, then how to minimize the footprint of merging different PDF documents so it would fit the limits set by GAE task request?

UPDATE: Alternatives to Conversion API much appreciated.

2nd UPDATE Conversion API is being decommissioned, so that's not an option now.

3rd UPDATE Can Pileline or MapReduce API's help here?