I'm trying to use SXSSFWorkbook to write an Excel spreadsheet from scratch.
wb = SXSSFWorkbook(500)
wb.isCompressTempFiles = true
sh = streamingWorkbook.createSheet(t.getMessage("template.sheet.name"))
All is fine but when I invoke the final code:
val out = FileOutputStream(localPath)
wb.write(out)
out.close()
// dispose of temporary files backing this workbook on disk
wb.dispose()
I get a huge excel file instead of the zipped XLSX I'm expecting. I've tried to manually zip the file and from a 120MB file I can get it down to 9MB. So what am I missing?
Using: Kotlin and
implementation group: 'org.apache.poi', name: 'poi-ooxml', version: '4.1.2' // For `.xlsx` files
-- Update 1
I was under the impression that xlsx are zipped files containing xml data in essence [1]. What POI outputs through both XSSFWorkbook and SXSSFWorkbook can be compressed at least by an order of magnitude of 10. I've used this simple code to demonstrate:
fun main() {
val workbook = XSSFWorkbook()
writeRowsAndSave(workbook, "test.xlsx")
workbook.close()
val streamingWorkbook = SXSSFWorkbook(IN_MEMORY_ROWS_WINDOW_SIZE)
streamingWorkbook.isCompressTempFiles = true
writeRowsAndSave(streamingWorkbook, "test-streaming.xlsx")
streamingWorkbook.dispose()
}
private fun writeRowsAndSave(workbook: Workbook, fileName: String) {
val ROWS_COUNT = 2_000
val COLS_COUNT = 1_000
val sheet = workbook.createSheet("Test Sheet 1")
for (i in 1..ROWS_COUNT) {
val row = sheet.createRow(i)
println("Row $i")
for(j in 1..COLS_COUNT) {
row.createCell(j).setCellValue("Test $i")
}
}
FileOutputStream("./$fileName").use {
workbook.write(it)
}
}
This produces 5MB-each files that when compressed have roughly 439KB (?!).
The rowAccessWindowSize sets the count of rows that kept in memory before they will flushed into the temporary files. All rows that are written already to the temporary sheet file cannot be accessed anymore later because they are not more in memory but only in the temporary file.
SXSSFWorkbook
defaults to using inline strings instead of a shared strings table. That means SXSSFWorkbook
writes text directly in the sheet even if it is multiple times the same text. XSSFWorkbook
and Excel's GUI both are using a shared strings table where text gets an index and the same text is stored only once and the index is used in the sheet then. But that should not have that big impact to the file size of the resulting *.xlsx
.
SXSSFWorkbook
, as well as all other Office Open XML
formatted files, apache poi
creates, were zipped using org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream
. That uses deflate as compression algorithm and Deflater.DEFAULT_COMPRESSION
as the default compression level. One could overwrite protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out)
of SXSSFWorkbook
to set another compression level. But that also should not have that big impact to the file size of the resulting *.xlsx
.
Example Java
code:
import java.io.File;
import java.io.OutputStream;
import java.io.FileOutputStream;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.apache.commons.compress.archivers.zip.Zip64Mode;
import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream;
import java.util.zip.Deflater;
class CreateSXSSFDifferentCompression {
static SXSSFWorkbook createSXSSFWorkbook(int compressionLevel, int rowAccessWindowSize,
boolean compressTmpFiles, boolean useSharedStringsTable) {
SXSSFWorkbook workbook = null;
if (compressionLevel != Deflater.DEFAULT_COMPRESSION) {
workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable) {
protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out) {
ZipArchiveOutputStream zos = new ZipArchiveOutputStream(out);
zos.setUseZip64(Zip64Mode.AsNeeded);
zos.setLevel(compressionLevel);
return zos;
}
};
} else {
workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable);
}
return workbook;
}
public static void main(String[] args) throws Exception {
SXSSFWorkbook workbook = null;
// uses Deflater.DEFAULT_COMPRESSION and inline strings
//workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, false);
// uses Deflater.DEFAULT_COMPRESSION and shared strings table
//workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, true);
// uses Deflater.BEST_COMPRESSION and inline strings
workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, false);
// uses Deflater.BEST_COMPRESSION and shared strings table
//workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, true);
int ROWS_COUNT = 2000;
int COLS_COUNT = 1000;
Sheet sheet = workbook.createSheet("Test Sheet 1");
for (int i = 1 ; i <= ROWS_COUNT; i++) {
Row row = sheet.createRow(i);
//System.out.println("Row " + i);
for(int j = 1; j <= COLS_COUNT; j++) {
row.createCell(j).setCellValue("Test " + i);
}
}
FileOutputStream out = new FileOutputStream("./Excel.xlsx");
workbook.write(out);
out.close();
workbook.close();
workbook.dispose();
File file = new File("./Excel.xlsx");
System.out.println(file.length());
}
}
This results in Excel.xlsx
file size of:
5,031,034 bytes when Deflater.DEFAULT_COMPRESSION and inline strings are used.
4,972,663 bytes when Deflater.DEFAULT_COMPRESSION and shared strings table are used.
4,972,915 bytes when Deflater.BEST_COMPRESSION and inline strings are used.
And 4,966,749 bytes when Deflater.BEST_COMPRESSION and shared strings table are used.
Used: Java 12
, apache poi 4.1.2
, Ubuntu Linux
.
Neither I would call that huge for a spreadsheet of 2,000 rows x 1,000 columns nor I would call the impact of the different settings big.
And the entries are compressed very good.
If you look into the Excel.xlsx
ZIP archive, you will find the uncompressed size of xl/worksheets/sheet1.xml
112,380,273 bytes when inline strings are used. The uncompressed size of xl/sharedStrings.xml
is 138 bytes then and only contains the very basic XML.
If shared strings table is used, then the uncompressed size of xl/worksheets/sheet1.xml
is 68,377,273 bytes and the uncompressed size of xl/sharedStrings.xml
is 49,045 bytes and contains 2,000 entries.
If Excel
itself saves *.xlsx
files, it creates files having approximately the same file size when content is equal. So Excel
itself uses the same compression level.
Of course one can compressing the *.xlsx
files more when storing the Excel.xlsx
into a *.zip
archive again. But that would not be what Excel
expects to be a *.xlsx
file.
Microsoft
states at What are the benefits of Open XML Formats?:
Compact files Files are automatically compressed and can be up to 75 percent smaller in some cases. The Open XML Format uses zip compression technology to store documents, offering potential cost savings as it reduces the disk space required to store files and decreases the bandwidth needed to send files via e-mail, over networks, and across the Internet. When you open a file, it is automatically unzipped. When you save a file, it is automatically zipped again. You do not have to install any special zip utilities to open and close files in Office.
The important parts here are:
When you open a file, it is automatically unzipped. When you save a file, it is automatically zipped again.
That means, if apache poi
would compress the files using other methods or compression levels than Microsoft Office
itself, then Microsoft Office
would be unable to do so with files apache poi
had created.
So, since apache poi
creates files which Excel
(Microsoft Office
) is able opening directly, it uses the same compression method and compression level as Excel
(Microsoft Office
) will do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With