Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas to excel using XLSX Writer - Constant Memory/In Memory

I am using pandas and xlsxwriter to create excel reports in AWS lambda. I have been having an issue where the disk space fills up when a reasonably large excel file in the region of 60MB is created, but there is 512MB of disk space available.

I have been trying to work out what and why this is happening and I came across a promising looking property I could pass, constant_memory, so I tried this out using the following code:

with pd.ExcelWriter(output, options={"constant_memory": True}) as writer:

It seems this actually ensures my lambda runs to completion consistently and if I remove it then it no longer does, but the issue is I get the following warning:

FutureWarning: Use of **kwargs is deprecated, use engine_kwargs instead.
    with pd.ExcelWriter(output, options={"constant_memory": True}) as writer:

I then read a post on SO, which I think is by the author of xlsxwriter suggesting this property doesn't actually work when using pandas. This is sort of evident in that I only get the first column in my excel report and the last row of 360000 rows. I am confused as to how passing this is ensuring my lambda completes0? And why is it that if I do not pass it then my lambda runs out of disk space?

The second strange thing is the documentation for pandas doesn't mention any argument called options at all, and if I try and pass constant_memory using engine_kwargs:

with pd.ExcelWriter(output, engine_kwargs={'constant_memory': True}) as writer:

I get the following error:

TypeError: __init__() got an unexpected keyword argument 'constant_memory'

I hope someone who is more familiar with xlsxwriter can help me out in understanding these few points.

The documentation for xlsxwriter also mentions a property in_memory, which seems like it could resolve my issue but I am unable to pass this to pd.ExcelWriter:

TypeError: __init__() got an unexpected keyword argument 'in_memory'

**Edit: Passing in_memory does actually solve the issues I have but I get a deprecation warning passing it in using the kwargs options.

with pd.ExcelWriter(output, options={"in_memory": True}) as writer:

FutureWarning: Use of **kwargs is deprecated, use engine_kwargs instead.
    with pd.ExcelWriter(output, options={"in_memory": True}) as writer:

How can I pass this property to the xlsxwriter using pandas without getting the deprecation warning as it seems to resolve my issues?

like image 990
berimbolo Avatar asked Jul 23 '21 21:07

berimbolo


1 Answers

The correct syntax to pass xlsxwriter options to Pandas when you get that deprecation notice is:

writer = pd.ExcelWriter('pandas_example.xlsx',
                        engine='xlsxwriter',
                        engine_kwargs={'options': {'strings_to_numbers': True}})

See this section of the XlsxWriter docs.

However, as you pointed our in your question, and from your observation, the constant_memory option won't work with Pandas since it requires data to be written in a row by row order but Pandas uses a column by column order.

But also, just to be clear, the constant_memory option will only reduce the "memory" used by your application. It won't make the size of the file smaller (in fact it generally makes the file bigger if it contains a lot of string data). It may only by "solving" your issue because it is only writing one column of data to the file, and thus the file is much smaller than intended.

like image 81
jmcnamara Avatar answered Oct 17 '22 05:10

jmcnamara