Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache POI Streaming (SXSSF) for Reading

Tags:

I need to read large excel files and import their data to my application.

Since POI takes up a large amount of heap to work, often throwing OutOfMemory errors, I found out that there is a Streaming API for handling excel data in a serial fashion (rather than loading the file completely into memory)

I created a xlsx workbook, with a single worksheet, and typed in several values in cells and came up with the following code to attempt reading it:

public static void main(String[] args) throws Throwable {     // keep 100 rows in memory, exceeding rows will be flushed to disk     SXSSFWorkbook wb = new SXSSFWorkbook(new XSSFWorkbook(new FileInputStream("C:\\test\\tst.xlsx")));     SXSSFSheet sheet = (SXSSFSheet) wb.getSheetAt(0);     Row row = sheet.getRow(0);     //row is always null     while(row.iterator().hasNext()){ //-> NullPointerException         System.out.println(row.getCell(0).getStringCellValue());     } } 

However, despite being able to get its worksheets properly, it always comes with empty (null) rows.

I have researched and found out several examples of the Streaming API in the internet, but none of them are about reading existing files, they're all about generating excel files.

Is it actually possible to read data from existing .xlsx files in a stream?

like image 377
bruno_cw Avatar asked Nov 18 '15 17:11

bruno_cw


People also ask

What is difference between HSSF and XSSF?

HSSF is the POI Project's pure Java implementation of the Excel '97(-2007) file format. XSSF is the POI Project's pure Java implementation of the Excel 2007 OOXML (. xlsx) file format. HSSF and XSSF provides ways to read spreadsheets create, modify, read and write XLS spreadsheets.

Can HSSF read XLSX file?

According to my research, HSSF is used to read xls and XSSF to read xlsx.

What is Sxssf workbook?

public class SXSSFWorkbook extends java.lang.Object implements Workbook. Streaming version of XSSFWorkbook implementing the "BigGridDemo" strategy. This allows to write very large files without running out of memory as only a configurable portion of the rows are kept in memory at any one time.

How do I use XSSFWorkbook?

XSSFWorkbook(java.io.InputStream is) Constructs an XSSFWorkbook object, by buffering the whole input stream into memory and then opening an OPCPackage object for it.


1 Answers

After digging up some more, I found out this library:

If you've used Apache POI in the past to read in Excel files, you probably noticed that it's not very memory efficient. Reading in an entire workbook will cause a severe memory usage spike, which can wreak havoc on a server.

There are plenty of good reasons for why Apache has to read in the whole workbook, but most of them have to do with the fact that the library allows you to read and write with random addresses. If (and only if) you just want to read the contents of an Excel file in a fast and memory effecient way, you probably don't need this ability. Unfortunately, the only thing in the POI library for reading a streaming workbook requires your code to use a SAX-like parser. All of the friendly classes like Row and Cell are missing from that API.

This library serves as a wrapper around that streaming API while preserving the syntax of the standard POI API. Read on to see if it's right for you.

InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx")); StreamingReader reader = StreamingReader.builder()         .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)         .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)         .sheetIndex(0)        // index of sheet to use (defaults to 0)         .sheetName("sheet1")  // name of sheet to use (overrides sheetIndex)         .read(is);            // InputStream or File for XLSX file (required) 

There is also SAX Event API, which reads the document and parse its contents through events.

If memory footprint is an issue, then for XSSF, you can get at the underlying XML data, and process it yourself. This is intended for intermediate developers who are willing to learn a little bit of low level structure of .xlsx files, and who are happy processing XML in java. Its relatively simple to use, but requires a basic understanding of the file structure. The advantage provided is that you can read a XLSX file with a relatively small memory footprint.

like image 165
bruno_cw Avatar answered Oct 07 '22 22:10

bruno_cw