Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load a large xlsx file with Apache POI?

I have a large .xlsx file (141 MB, containing 293413 lines with 62 columns each) I need to perform some operations within.

I am having problems with loading this file (OutOfMemoryError), as POI has a large memory footprint on XSSF (xlsx) workbooks.

This SO question is similar, and the solution presented is to increase the VM's allocated/maximum memory.

It seems to work for that kind of file-size (9MB), but for me, it just simply doesn't work even if a allocate all available system memory. (Well, it's no surprise considering the file is over 15 times larger)

I'd like to know if there is any way to load the workbook in a way it won't consume all the memory, and yet, without doing the processing based (going into) the XSSF's underlying XML. (In other words, maintaining a puritan POI solution)

If there isn't tough, you are welcome to say it ("There isn't.") and point me the ways to a "XML" solution.

like image 878
XenoRo Avatar asked Aug 09 '12 20:08

XenoRo


People also ask

Does Apache POI support Xlsx?

The Apache POI library supports both . xls and . xlsx files and is a more complex library than other Java libraries for working with Excel files.


2 Answers

I was in a similar situation with a webserver environment. The typical size of the uploads were ~150k rows and it wouldn't have been good to consume a ton of memory from a single request. The Apache POI Streaming API works well for this, but it requires a total redesign of your read logic. I already had a bunch of read logic using the standard API that I didn't want to have to redo, so I wrote this instead: https://github.com/monitorjbl/excel-streaming-reader

It's not entirely a drop-in replacement for the standard XSSFWorkbook class, but if you're just iterating through rows it behaves similarly:

import com.monitorjbl.xlsx.StreamingReader;  InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx")); StreamingReader reader = StreamingReader.builder()         .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)         .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)         .sheetIndex(0)        // index of sheet to use (defaults to 0)         .read(is);            // InputStream or File for XLSX file (required)  for (Row r : reader) {   for (Cell c : r) {     System.out.println(c.getStringCellValue());   } }      

There are some caveats to using it; due to the way XLSX sheets are structured, not all data is available in the current window of the stream. However, if you're just trying to read simple data out from the cells, it works pretty well for that.

like image 137
monitorjbl Avatar answered Oct 04 '22 11:10

monitorjbl


A improvement in memory usage can be done by using a File instead of a Stream. (It is better to use a streaming API, but the Streaming API's have limitations, see http://poi.apache.org/spreadsheet/index.html)

So instead of

Workbook workbook = WorkbookFactory.create(inputStream); 

do

Workbook workbook = WorkbookFactory.create(new File("yourfile.xlsx")); 

This is according to : http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream

Files vs InputStreams

"When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file."

like image 23
rjdkolb Avatar answered Oct 04 '22 11:10

rjdkolb