Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

flask: Writing directly to response stream

Hej all,

I've got an flask restplus server in place and I am trying to achieve to deliver excel as an octet-stream to my clients. It seems that pandas.to_excel(..) consumes a lot of time (30 secs for about 120k rows) when serializing large DataFrames.

Please see below my current implementation:

def format(data_frame):
    # Idea is to directly write to the flask output stream, instead of buffering
    # the whole excel as io.BytesIO. Is there a way to do it?
    output = io.BytesIO()
    writer = pandas.ExcelWriter(output, engine='xlsxwriter')
    data_frame_ordered = data_frame.reindex_axis(sorted(data_frame.columns), axis=1)

    # This consumes a lot of time
    data_frame_ordered.to_excel(writer, sheet_name='ML Data', na_rep=0, index=False, encoding='utf-8')

    # This consumes a lot of time, too.
    writer.save()

    return output.getvalue()


@api.route('/excel', methods=['GET'])
class ExcelResource(Resource):
    def get(self, args):
        # Well, thats a huge pandas.DataFrame
        data_frame = ...
        resp = make_response(format(data_frame))
        resp.headers['Content-Length'] = resp.content_length
        resp.headers['Content-Type'] = 'application/octet-stream'

        return resp

Is there a way to write the excel directly to the flask output stream, without buffering it into an BytesIO instance?

Thanks in advance

Dennis

like image 735
Dennis Avatar asked Jan 26 '17 10:01

Dennis


1 Answers

You can try to create some file-like object that will provide streaming interface to you, like:

import threading
from flask import Response
from Queue import Queue


class StreamWriter(object):
    def __init__(self):
        self.queue = Queue()

    def write(self, some):
        self.queue.put(some)

    def read(self):
        return self.queue.get(True)

    def flush(self):
        pass

    def tell(self):
        #probably some code
        pass

    def seek(self):
        #probably some code
        pass

    def close(self):
        self.queue.put(None)

@api.route('/excel', methods=['GET'])
class ExcelResource(Resource):
    def get(self, args):
        def generate():
            output = StreamWriter()

            def do_stuff():
                output = StreamWriter()

                writer = pandas.ExcelWriter(output, engine='xlsxwriter')
                data_frame_ordered = data_frame.reindex_axis(sorted(data_frame.columns), axis=1)

                # This consumes a lot of time
                data_frame_ordered.to_excel(writer, sheet_name='ML Data', na_rep=0, index=False, encoding='utf-8')

                # This consumes a lot of time, too.
                writer.save()
                output.close()

            threading.Thread(target=do_stuff).start()
            while True:
                chunk = output.read()
                if chunk is None:
                    break
                yield chunk

        return Response(generate(), headers={some_headers})

It's just a rough idea, and this code is untested!

like image 132
Alex Pertsev Avatar answered Sep 20 '22 08:09

Alex Pertsev