Hej all,
I've got an flask restplus server in place and I am trying to achieve to deliver excel as an octet-stream to my clients. It seems that pandas.to_excel(..) consumes a lot of time (30 secs for about 120k rows) when serializing large DataFrames.
Please see below my current implementation:
def format(data_frame):
# Idea is to directly write to the flask output stream, instead of buffering
# the whole excel as io.BytesIO. Is there a way to do it?
output = io.BytesIO()
writer = pandas.ExcelWriter(output, engine='xlsxwriter')
data_frame_ordered = data_frame.reindex_axis(sorted(data_frame.columns), axis=1)
# This consumes a lot of time
data_frame_ordered.to_excel(writer, sheet_name='ML Data', na_rep=0, index=False, encoding='utf-8')
# This consumes a lot of time, too.
writer.save()
return output.getvalue()
@api.route('/excel', methods=['GET'])
class ExcelResource(Resource):
def get(self, args):
# Well, thats a huge pandas.DataFrame
data_frame = ...
resp = make_response(format(data_frame))
resp.headers['Content-Length'] = resp.content_length
resp.headers['Content-Type'] = 'application/octet-stream'
return resp
Is there a way to write the excel directly to the flask output stream, without buffering it into an BytesIO instance?
Thanks in advance
Dennis
You can try to create some file-like object that will provide streaming interface to you, like:
import threading
from flask import Response
from Queue import Queue
class StreamWriter(object):
def __init__(self):
self.queue = Queue()
def write(self, some):
self.queue.put(some)
def read(self):
return self.queue.get(True)
def flush(self):
pass
def tell(self):
#probably some code
pass
def seek(self):
#probably some code
pass
def close(self):
self.queue.put(None)
@api.route('/excel', methods=['GET'])
class ExcelResource(Resource):
def get(self, args):
def generate():
output = StreamWriter()
def do_stuff():
output = StreamWriter()
writer = pandas.ExcelWriter(output, engine='xlsxwriter')
data_frame_ordered = data_frame.reindex_axis(sorted(data_frame.columns), axis=1)
# This consumes a lot of time
data_frame_ordered.to_excel(writer, sheet_name='ML Data', na_rep=0, index=False, encoding='utf-8')
# This consumes a lot of time, too.
writer.save()
output.close()
threading.Thread(target=do_stuff).start()
while True:
chunk = output.read()
if chunk is None:
break
yield chunk
return Response(generate(), headers={some_headers})
It's just a rough idea, and this code is untested!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With