Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate large file and send it

I have a rather large .csv file (up to 1 million lines) that I want to generate and send when a browser requests it.

The current code I have is (except that I don't actually generate the same data):

class CSVHandler(tornado.web.RequestHandler): 
  def get(self):
    self.set_header('Content-Type','text/csv')
    self.set_header('content-Disposition','attachement; filename=dump.csv')  
    self.write('lineNumber,measure\r\n') # File header
    for line in range(0,1000000): 
      self.write(','.join([str(line),random.random()])+'\r\n') # mock data

app = tornado.web.Application([(r"/csv",csvHandler)])
app.listen(8080)

The problems I have with the method above are:

  • The web browser doesn't directly start downloading chunks that are sent. It hangs while the webserver seems to prepare the whole content.
  • The web server is blocked while it processes this request and makes other clients hang.
like image 607
Christopher Chiche Avatar asked Jan 09 '23 02:01

Christopher Chiche


1 Answers

By default, all data is buffered in memory until the end of the request so that it can be replaced with an error page if an exception occurs. To send a response incrementally, your handler must be asynchronous (so it can be interleaved with both the writing of the response and other requests on the IOLoop) and use the RequestHandler.flush() method.

Note that "being asynchronous" is not the same as "using the @tornado.web.asynchronous decorator"; in this case I recommend using @tornado.gen.coroutine instead of @asynchronous. This allows you to simply use the yield operator with every flush:

class CSVHandler(tornado.web.RequestHandler): 
    @tornado.gen.coroutine
    def get(self):
        self.set_header('Content-Type','text/csv')
        self.set_header('content-Disposition','attachment; filename=dump.csv')  
        self.write('lineNumber,measure\r\n') # File header
        for line in range(0,1000000): 
            self.write(','.join([str(line),random.random()])+'\r\n') # mock data
            yield self.flush()

self.flush() starts the process of writing the data to the network, and yield waits until that data has reached the kernel. This lets other handlers run and also helps manage memory consumption (by limiting how far ahead of the client's download speed you can get). Flushing after every line of a CSV file is a little expensive, so you may want to only flush after every 100 or 1000 lines.

Note that if there is an exception once the download has started, there is no way to show an error page to the client; you can only cut the download off partway through. Try to validate the request and do everything that is likely to fail before the first call to flush().

like image 98
Ben Darnell Avatar answered Jan 14 '23 11:01

Ben Darnell