Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does POI XSSF still have crazy bad memory issues?

A couple years ago, I ran into issues where I was creating large excel files using jXLS and POI XSSF. If my memory is correct, I think XSSF would create something like 1GB+ temp files on the disk to create 10mb excel files. So I stopped using jXLS and instead used SXSSF to create the excel files, but today I have new reasons to use jXLS or JETT.

Both jXLS and JETT websites seem to allude that performance is much better, but POI's XSSF website still says generically that the XSSF requires a higher memory footprint. I am wondering if this higher memory footprint is something like a reasonable 10% overhead these days, or if it is still like the 10,000% overhead as it was a couple years ago.

Are the crazy bad memory issues fixed with POI 3.9 XSSF? Should I not worry about using it with jXLS or JETT? Or are there certain gotchas to avoid? I am careful about reusing cell styles.

like image 826
Bob Thule Avatar asked Jan 31 '14 19:01

Bob Thule


People also ask

What is the difference between HSSF and XSSF?

HSSF is the POI Project's pure Java implementation of the Excel '97(-2007) file format. XSSF is the POI Project's pure Java implementation of the Excel 2007 OOXML (. xlsx) file format. HSSF and XSSF provides ways to read spreadsheets create, modify, read and write XLS spreadsheets.

What is XSSF function?

XSSF (XML Spreadsheet Format) − It is used for xlsx file format of MS-Excel. HPSF (Horrible Property Set Format) − It is used to extract property sets of the MS-Office files. HWPF (Horrible Word Processor Format) − It is used to read and write doc extension files of MS-Word.

What is HSSF in Apache POI?

HSSF It's Stand for “Horrible Spreadsheet Format”. It is used to read and write xls format of MS-Excel files. XSSF It's Stand for “XML Spreadsheet Format”. It is used for xlsx file format of MS-Excel.


1 Answers

To answer your question, yes, POI will always use very large amount of memory when working on large XLSX files, which is much larger than the size of the XLSX files. I don't think this will change anytime soon, and there are pretty obvious reasons for that: XLSX is basically a bunch of zipped XML files, and XML is very well compressed (around 10x). Getting this XML just to sit in memory uncompressed would already increase the memory consumption tenfold, so if you add all the overhead of data structures, there's no way you should expect a 10% increase in memory consumption over the XLSX file size.

Now, the good news is that as mentioned in the comments, Apache POI introduced SXSSF for streaming very large amount of data in a spreadsheet with very good performance and low memory usage. XLSX files generated this way are still streamed on the hard disk where they can end up taking quite a bit of space, but at least you don't risk OOME when writing hundreds of thousands of rows.

The problem for you is that you won't be able to get JETT to directly work with SXSSF, as it needs the whole document loaded in memory for performing template filling. JETT author quickly discussed this topic here.

I had the same problem, and ended up doing a two-step XLSX creation:

  1. A standard JETT XLSX template to generate headers and formatting. The last row of the first sheet contains cells with $$tokens$$, one per cell. I don't use JETT to insert the large amount of rows.

  2. Once JETT did its work, I reopen the workbook, read then delete the $$tokens$$ on the last line of the first spreadsheet, and start streaming data with SXSSF row by row.

Of course, there are limitations to this approach: - You cannot use JETT on any of the streamed rows during rows insertion (but you can before, to dynamically pick the order of the $$tokens$$ for example) - Cells format won't be copied unless you take care of it yourself with POI API. I personally prefer to format whole columns in my XLSX file, and it will apply to the streamed data.

This also works if you want to show charts using data inserted with SXSSF: You can define a Named Range with functions OFFSET and COUNTA, then create a Pivot table & Pivot Chart that will be refreshed when the XLSX is opened in Excel.

like image 84
Etienne C Avatar answered Oct 30 '22 15:10

Etienne C