Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SXSSFWorkbook.write to FileOutputStream writes huge files

I'm trying to use SXSSFWorkbook to write an Excel spreadsheet from scratch.

      wb = SXSSFWorkbook(500)
      wb.isCompressTempFiles = true
      sh = streamingWorkbook.createSheet(t.getMessage("template.sheet.name"))

All is fine but when I invoke the final code:

    val out = FileOutputStream(localPath)
    wb.write(out)
    out.close()
    // dispose of temporary files backing this workbook on disk
    wb.dispose()

I get a huge excel file instead of the zipped XLSX I'm expecting. I've tried to manually zip the file and from a 120MB file I can get it down to 9MB. So what am I missing?

Using: Kotlin and

    implementation group: 'org.apache.poi', name: 'poi-ooxml', version: '4.1.2'  // For `.xlsx` files

-- Update 1

I was under the impression that xlsx are zipped files containing xml data in essence [1]. What POI outputs through both XSSFWorkbook and SXSSFWorkbook can be compressed at least by an order of magnitude of 10. I've used this simple code to demonstrate:

fun main() {
  val workbook = XSSFWorkbook()
  writeRowsAndSave(workbook, "test.xlsx")
  workbook.close()

  val streamingWorkbook = SXSSFWorkbook(IN_MEMORY_ROWS_WINDOW_SIZE)
  streamingWorkbook.isCompressTempFiles = true
  writeRowsAndSave(streamingWorkbook, "test-streaming.xlsx")
  streamingWorkbook.dispose()
}

private fun writeRowsAndSave(workbook: Workbook, fileName: String) {
  val ROWS_COUNT = 2_000
  val COLS_COUNT = 1_000

  val sheet = workbook.createSheet("Test Sheet 1")
  for (i in 1..ROWS_COUNT) {
    val row = sheet.createRow(i)
    println("Row $i")
    for(j in 1..COLS_COUNT) {
        row.createCell(j).setCellValue("Test $i")
    }
  }

  FileOutputStream("./$fileName").use {
      workbook.write(it)
  }
}

This produces 5MB-each files that when compressed have roughly 439KB (?!).

like image 724
Mircea D. Avatar asked Mar 31 '20 14:03

Mircea D.


People also ask

What is rowAccessWindowSize?

The rowAccessWindowSize sets the count of rows that kept in memory before they will flushed into the temporary files. All rows that are written already to the temporary sheet file cannot be accessed anymore later because they are not more in memory but only in the temporary file.


1 Answers

SXSSFWorkbook defaults to using inline strings instead of a shared strings table. That means SXSSFWorkbook writes text directly in the sheet even if it is multiple times the same text. XSSFWorkbook and Excel's GUI both are using a shared strings table where text gets an index and the same text is stored only once and the index is used in the sheet then. But that should not have that big impact to the file size of the resulting *.xlsx.

SXSSFWorkbook, as well as all other Office Open XML formatted files, apache poi creates, were zipped using org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream. That uses deflate as compression algorithm and Deflater.DEFAULT_COMPRESSION as the default compression level. One could overwrite protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out) of SXSSFWorkbook to set another compression level. But that also should not have that big impact to the file size of the resulting *.xlsx.

Example Java code:

import java.io.File;
import java.io.OutputStream;
import java.io.FileOutputStream;

import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;

import org.apache.commons.compress.archivers.zip.Zip64Mode;
import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream;
import java.util.zip.Deflater;

class CreateSXSSFDifferentCompression {

 static SXSSFWorkbook createSXSSFWorkbook(int compressionLevel, int rowAccessWindowSize, 
                                          boolean compressTmpFiles, boolean useSharedStringsTable) {
  SXSSFWorkbook workbook = null;
  if (compressionLevel != Deflater.DEFAULT_COMPRESSION) {
   workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable) {
    protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out) {
     ZipArchiveOutputStream zos = new ZipArchiveOutputStream(out);
     zos.setUseZip64(Zip64Mode.AsNeeded);  
     zos.setLevel(compressionLevel);
     return zos;
    }    
   }; 
  } else {
   workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable);
  }
  return workbook;
 }

 public static void main(String[] args) throws Exception {

  SXSSFWorkbook workbook = null;

  // uses Deflater.DEFAULT_COMPRESSION and inline strings
  //workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, false); 

  // uses Deflater.DEFAULT_COMPRESSION and shared strings table
  //workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, true); 

  // uses Deflater.BEST_COMPRESSION and inline strings
  workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, false); 

  // uses Deflater.BEST_COMPRESSION and shared strings table
  //workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, true); 

  int ROWS_COUNT = 2000;
  int COLS_COUNT = 1000;

  Sheet sheet = workbook.createSheet("Test Sheet 1");
  for (int i = 1 ; i <= ROWS_COUNT; i++) {
   Row row = sheet.createRow(i);
   //System.out.println("Row " + i);
   for(int j = 1; j <= COLS_COUNT; j++) {
    row.createCell(j).setCellValue("Test " + i);
   }
  }

  FileOutputStream out = new FileOutputStream("./Excel.xlsx");
  workbook.write(out);
  out.close();
  workbook.close();
  workbook.dispose();

  File file = new File("./Excel.xlsx");
  System.out.println(file.length());

 }
}

This results in Excel.xlsx file size of:

5,031,034 bytes when Deflater.DEFAULT_COMPRESSION and inline strings are used.

4,972,663 bytes when Deflater.DEFAULT_COMPRESSION and shared strings table are used.

4,972,915 bytes when Deflater.BEST_COMPRESSION and inline strings are used.

And 4,966,749 bytes when Deflater.BEST_COMPRESSION and shared strings table are used.

Used: Java 12, apache poi 4.1.2, Ubuntu Linux.

Neither I would call that huge for a spreadsheet of 2,000 rows x 1,000 columns nor I would call the impact of the different settings big.

And the entries are compressed very good.

If you look into the Excel.xlsx ZIP archive, you will find the uncompressed size of xl/worksheets/sheet1.xml 112,380,273 bytes when inline strings are used. The uncompressed size of xl/sharedStrings.xml is 138 bytes then and only contains the very basic XML.

If shared strings table is used, then the uncompressed size of xl/worksheets/sheet1.xml is 68,377,273 bytes and the uncompressed size of xl/sharedStrings.xml is 49,045 bytes and contains 2,000 entries.

If Excel itself saves *.xlsx files, it creates files having approximately the same file size when content is equal. So Excel itself uses the same compression level.

Of course one can compressing the *.xlsx files more when storing the Excel.xlsx into a *.zip archive again. But that would not be what Excel expects to be a *.xlsx file.

Microsoft states at What are the benefits of Open XML Formats?:

Compact files Files are automatically compressed and can be up to 75 percent smaller in some cases. The Open XML Format uses zip compression technology to store documents, offering potential cost savings as it reduces the disk space required to store files and decreases the bandwidth needed to send files via e-mail, over networks, and across the Internet. When you open a file, it is automatically unzipped. When you save a file, it is automatically zipped again. You do not have to install any special zip utilities to open and close files in Office.

The important parts here are:

When you open a file, it is automatically unzipped. When you save a file, it is automatically zipped again.

That means, if apache poi would compress the files using other methods or compression levels than Microsoft Office itself, then Microsoft Office would be unable to do so with files apache poi had created.

So, since apache poi creates files which Excel (Microsoft Office) is able opening directly, it uses the same compression method and compression level as Excel (Microsoft Office) will do.

like image 112
Axel Richter Avatar answered Oct 05 '22 19:10

Axel Richter