Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate large files (PDF and CSV) using AppEngine and Datastore?

When I first started developing this project, there was no requirement for generating large files, however it is now a deliverable.

Long story short, GAE just doesn't play nice with any large scale data manipulation or content generation. The lack of file storage aside, even something as simple as generating a pdf with ReportLab with 1500 records seems to hit a DeadlineExceededError. This is just a simple pdf comprised of a table.

I am using the following code:

    self.response.headers['Content-Type'] = 'application/pdf'
    self.response.headers['Content-Disposition'] = 'attachment; filename=output.pdf'
    doc = SimpleDocTemplate(self.response.out, pagesize=landscape(letter))

    elements = []

    dataset = Voter.all().order('addr_str')

    data = [['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE']]

    i = 0
    r = 1
    s = 100

    while ( i < 1500 ):
        voters = dataset.fetch(s, offset=i)
        for voter in voters:
            data.append([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname ])
            r = r + 1
        i = i + s

    t=Table(data, '', r*[0.4*inch], repeatRows=1 )
    t.setStyle(TableStyle([('ALIGN',(0,0),(-1,-1),'CENTER'),
                           ('INNERGRID', (0,0), (-1,-1), 0.15, colors.black),
                           ('BOX', (0,0), (-1,-1), .15, colors.black),
                           ('FONTSIZE', (0,0), (-1,-1), 8)
                           ]))

    elements.append(t)

    doc.build(elements) 

Nothing particularly fancy, but it chokes. Is there a better way to do this? If I could write to some kind of file system and generate the file in bits, and then rejoin them that might work, but I think the system precludes this.

I need to do the same thing for a CSV file, however the limit is obviously a bit higher since it's just raw output.

    self.response.headers['Content-Type'] = 'application/csv'
    self.response.headers['Content-Disposition'] = 'attachment; filename=output.csv'

    dataset = Voter.all().order('addr_str')

    writer = csv.writer(self.response.out,dialect='excel')
    writer.writerow(['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE'])

    i = 0
    s = 100
    while ( i < 2000 ):
        last_cursor = memcache.get('db_cursor')
        if last_cursor:
            dataset.with_cursor(last_cursor)
        voters = dataset.fetch(s)
        for voter in voters:
            writer.writerow([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname])
        memcache.set('db_cursor', dataset.cursor())
        i = i + s
    memcache.delete('db_cursor')

Any suggestions would be very much appreciated.

Edit: possible solutions

Above I had documented three possible solutions based on my research, plus suggestions etc

They aren't necessarily mutually exclusive, and could be a slight variation or combination of any of the three, however the gist of the solutions are there. Let me know which one you think makes the most sense, and might perform the best.

Solution A: Using mapreduce (or tasks), serialize each record, and create a memcache entry for each individual record keyed with the keyname. Then process these items individually into the pdf/xls file. (use get_multi and set_multi)

Solution B: Using tasks, serialize groups of records, and load them into the db as a blob. Then trigger a task once all records are processed that will load each blob, deserialize them and then load the data into the final file.

Solution C: Using mapreduce, retrieve the keynames and store them as a list, or serialized blob. Then load the records by key, which would be faster than the current loading method. If I were to do this, which would be better, storing them as a list (and what would the limitations be...I presume a list of 100,000 would be beyond the capabilities of the datastore) or as a serialized blob (or small chunks which I then concatenate or process)

Thanks in advance for any advice.

like image 912
etc Avatar asked Oct 15 '10 01:10

etc


2 Answers

Here is one quick thought, assuming it is crapping out fetching from the datastore. You could use tasks and cursors to fetch the data in smaller chunks, then do the generation at the end.

Start a task which does the initial query and fetches 300 (arbitrary number) records, then enqueues a named(!important) task that you pass the cursor to. That one in turn queries [your arbitrary number] records, and then passes the cursor to a new named task as well. Continue that until you have enough records.

Within each task process the entities, then store the serialized result in a text or blob property on a 'processing' model. I would make the model's key_name the same as the task that created it. Keep in mind the serialized data will need to be under the API call size limit.

To serialize your table pretty fast you could use:

serialized_data = "\x1e".join("\x1f".join(voter) for voter in data)

Have the last task (when you get enough records) kick of the PDf or CSV generation. If you use key_names for you models you, should be able to grab all of the entities with encoded data by key. Fetches by key are pretty fast, you'll know the model's keys since you know the last task name. Again, you'll want to be mindful size of your fetches from the datastore!

To deserialize:

list(voter.split('\x1f') for voter in serialized_data.split('\x1e'))

Now run your PDF / CSV generation on the data. If splitting up the datastore fetches alone does not help you'll have to look into doing more of the processing in each task.

Don't forget in the 'build' task you'll want to raise an exception if any of the interim models are not yet present. Your final task will automatically retry.

like image 168
Robert Kluin Avatar answered Sep 27 '22 18:09

Robert Kluin


Some time ago I faced the same problem with GAE. After many attempts I just moved to another web hosting since I could do it. Nevertheless, before moving I had 2 ideas how to resolve it. I haven't implemented them, but you may try to.

First idea is to use SOA/RESTful service on another server, if it is possible. You can even create another application on GAE in Java, do all the work there (I guess with Java's PDFBox it will take much less time to generate PDF), and return result to Python. But this option needs you to know Java and also to divide your app to several parts with terrible modularity.

So, there's another approach: you can create a "ping-pong" game with a user's browser. The idea is that if you cannot make everything in a single request, force browser to send you several. During first request make only a part of work which fits 30 seconds limit, then save the state and generate 'ticket' - unique identifier of a 'job'. Finally, send the user response which is simple page with redirect back to your app, parametrized by a job ticket. When you get it. just restore state and proceed with the next part of job.

like image 45
ffriend Avatar answered Sep 27 '22 18:09

ffriend