Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading Big XLS and XLSX files

I'm aware of the posts that are around, I've tried several attempts to reach my objective, as I will elaborate below:

I have a .zip/.rar, that contains multiple xls & xlsx files.

Each excel file contains duzens up to thousands of rows, around 90 columns give or take (each excel file can have more or less columns).

I've created a java windowbuilder application, where I select a .zip/.rar file and select where to unzip these files to and create them using FileOutputStream. After each file being saved, I'm reading the file for it's content.

So far so good. After several attempts to avoid OOM (OutOfMemory) and speed things up, I've reached the 'final version' (which is quite awful but it's until I figure out how to read things properly) which I will explain:

File file = new File('certainFile.xlsx'); //or xls, For example purposes
Workbook wb;
Sheet sheet;
/*
There is a ton of other things up to this point that I don't consider relevant, as it's related to unzipping and renaming, etc. 
This is within a cycle

/
In every zip file, there is at least 1 or 2 files that somehow, when it goes to
WorkbookFactory.create(), it still gives an OOM because it recognizes is has 
a bit over a million rows, meaning it's an 2007 format file (according to our friend Google.com), or so I believe so.
When I open the xlsx file, it indeed has like 10-20mb size and thousands of empty rows. When I save it again
it has 1mb and a couple thousand. After many attempts to read as InputStream, File or trying to save it in 
an automatic way, I've worked with converting it to a CSV and read it differently, 
ence, this 'solution'. if parseAsXLS is true, it applies my regular logic 
per row per cell, otherwise I parse the CSV.
*/
if (file.getName().contains("xlsx")) {
    this.parseAsXLS = false;
    OPCPackage pkg = OPCPackage.open(file);
    //This is just to output the content into a csv file, that I will read later on and it gets overwritten everytime it comes by
    FileOutputStream fo = new FileOutputStream(this.filePath + File.separator + "excel.csv");
    PrintStream ps = new PrintStream(fo);
    XLSX2CSV xlsxCsvConverter = new XLSX2CSV(pkg, ps, 90);
    try {
        xlsxCsvConverter.process();
    } catch (Exception e) {
        //I've added a count at the XLSX2CSV class in order to limit the ammount of rows I want to fetch and throw an Exception on purpose
        System.out.println("Limited the file at 60k rows");
    }
} else {
    this.parseAsXLS = true;
    this.wb = WorkbookFactory.create(file);
    this.sheet = wb.getSheetAt(0);
}

What happens now is that a .xlsx (from a .zip file with several other .xls and .xlsx) has somewhat a certain character in a row and the XLSX2CSV considers it as endRow, which results in a incorrect output.

This is an example: imagelink

Note: The objective is to only fetch a certain set of columns that they have in commum (or might have, not obliged) from each excel file and put them together in a new Excel. The email column (that contains multiple emails seperated by a comma), has what I believe to be an 'enter' before the email, because if I erase it manually, it fixes the problem. However, the objective is to not manually open every excel and fix it, otherwise I'd just open every excel and copy-paste the columns I'd need. In that example, I'd require columns: fieldAA, fieldAG, fieldAL and fieldAN.

XLSX2CSV.java (I'm not the creator of this file, I just applied my needs to it)

import java.awt.List;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintStream;

import javax.xml.parsers.ParserConfigurationException;

import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellAddress;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * A rudimentary XLSX -> CSV processor modeled on the
 * POI sample program XLS2CSVmra from the package
 * org.apache.poi.hssf.eventusermodel.examples.
 * As with the HSSF version, this tries to spot missing
 *  rows and cells, and output empty entries for them.
 * <p>
 * Data sheets are read using a SAX parser to keep the
 * memory footprint relatively small, so this should be
 * able to read enormous workbooks.  The styles table and
 * the shared-string table must be kept in memory.  The
 * standard POI styles table class is used, but a custom
 * (read-only) class is used for the shared string table
 * because the standard POI SharedStringsTable grows very
 * quickly with the number of unique strings.
 * <p>
 * For a more advanced implementation of SAX event parsing
 * of XLSX files, see {@link XSSFEventBasedExcelExtractor}
 * and {@link XSSFSheetXMLHandler}. Note that for many cases,
 * it may be possible to simply use those with a custom 
 * {@link SheetContentsHandler} and no SAX code needed of
 * your own!
 */
public class XLSX2CSV {
    /**
     * Uses the XSSF Event SAX helpers to do most of the work
     *  of parsing the Sheet XML, and outputs the contents
     *  as a (basic) CSV.
     */
    private class SheetToCSV implements SheetContentsHandler {
        private boolean firstCellOfRow;
        private int currentRow = -1;
        private int currentCol = -1;
        private int maxrows = 60000;



        private void outputMissingRows(int number) {

            for (int i=0; i<number; i++) {
                for (int j=0; j<minColumns; j++) {
                    output.append(',');
                }
                output.append('\n');
            }
        }

        @Override
        public void startRow(int rowNum) {
            // If there were gaps, output the missing rows
            outputMissingRows(rowNum-currentRow-1);
            // Prepare for this row
            firstCellOfRow = true;
            currentRow = rowNum;
            currentCol = -1;

            if (rowNum == maxrows) {
                    throw new RuntimeException("Force stop at maxrows");
            }
        }

        @Override
        public void endRow(int rowNum) {
            // Ensure the minimum number of columns
            for (int i=currentCol; i<minColumns; i++) {
                output.append(',');
            }
            output.append('\n');
        }

        @Override
        public void cell(String cellReference, String formattedValue,
                XSSFComment comment) {
            if (firstCellOfRow) {
                firstCellOfRow = false;
            } else {
                output.append(',');
            }            

            // gracefully handle missing CellRef here in a similar way as XSSFCell does
            if(cellReference == null) {
                cellReference = new CellAddress(currentRow, currentCol).formatAsString();
            }

            // Did we miss any cells?
            int thisCol = (new CellReference(cellReference)).getCol();
            int missedCols = thisCol - currentCol - 1;
            for (int i=0; i<missedCols; i++) {
                output.append(',');
            }
            currentCol = thisCol;

            // Number or string?
            try {
                //noinspection ResultOfMethodCallIgnored
                Double.parseDouble(formattedValue);
                output.append(formattedValue);
            } catch (NumberFormatException e) {
                output.append('"');
                output.append(formattedValue);
                output.append('"');
            }
        }

        @Override
        public void headerFooter(String arg0, boolean arg1, String arg2) {
            // TODO Auto-generated method stub

        }
    }


    ///////////////////////////////////////

    private final OPCPackage xlsxPackage;

    /**
     * Number of columns to read starting with leftmost
     */
    private final int minColumns;

    /**
     * Destination for data
     */
    private final PrintStream output;

    /**
     * Creates a new XLSX -> CSV converter
     *
     * @param pkg        The XLSX package to process
     * @param output     The PrintStream to output the CSV to
     * @param minColumns The minimum number of columns to output, or -1 for no minimum
     */
    public XLSX2CSV(OPCPackage pkg, PrintStream output, int minColumns) {
        this.xlsxPackage = pkg;
        this.output = output;
        this.minColumns = minColumns;
    }

    /**
     * Parses and shows the content of one sheet
     * using the specified styles and shared-strings tables.
     *
     * @param styles The table of styles that may be referenced by cells in the sheet
     * @param strings The table of strings that may be referenced by cells in the sheet
     * @param sheetInputStream The stream to read the sheet-data from.

     * @exception java.io.IOException An IO exception from the parser,
     *            possibly from a byte stream or character stream
     *            supplied by the application.
     * @throws SAXException if parsing the XML data fails.
     */
    public void processSheet(
            StylesTable styles,
            ReadOnlySharedStringsTable strings,
            SheetContentsHandler sheetHandler, 
            InputStream sheetInputStream) throws IOException, SAXException {
        DataFormatter formatter = new DataFormatter();
        InputSource sheetSource = new InputSource(sheetInputStream);
        try {
            XMLReader sheetParser = SAXHelper.newXMLReader();
            ContentHandler handler = new XSSFSheetXMLHandler(
                  styles, null, strings, sheetHandler, formatter, false);
            sheetParser.setContentHandler(handler);
            sheetParser.parse(sheetSource);
         } catch(ParserConfigurationException e) {
            throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
         }
    }

    /**
     * Initiates the processing of the XLS workbook file to CSV.
     *
     * @throws IOException If reading the data from the package fails.
     * @throws SAXException if parsing the XML data fails.
     */
    public void process() throws IOException, OpenXML4JException, SAXException {
        ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(this.xlsxPackage);
        XSSFReader xssfReader = new XSSFReader(this.xlsxPackage);
        StylesTable styles = xssfReader.getStylesTable();
        XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
        int index = 0;
        while (iter.hasNext()) {
            try (InputStream stream = iter.next()) {
                processSheet(styles, strings, new SheetToCSV(), stream);
            }
            ++index;
        }
    }
} 

I'm in search of different (and working) approaches to my objective.

Thank you for your time

like image 882
abr Avatar asked Jul 20 '18 16:07

abr


People also ask

Which Excel format is best for large data?

xlsx at opening and saving Excel workbooks. This is particularly useful for very large files (greater than 10MB).

Is XLSX compatible with XLS?

Excel Supportability XLS files can be opened with all versions of Excel due to the backward compatibility. However, XLSX can only be opened with Excel 2007 and lateral versions only.

How do I read a xlsx file in Excel?

Using xlsx package. There are two main functions in xlsx package for reading both xls and xlsx Excel files: read.xlsx() and read.xlsx2() [faster on big files compared to read.xlsx function]. The simplified formats are: read.xlsx(file, sheetIndex, header=TRUE) read.xlsx2(file, sheetIndex, header=TRUE) file: file path.

What does the read_Excel mean?

The readxl package comes with the function read_excel () to read xls and xlsx files Read both xls and xlsx files library("readxl") my_data <- read_excel("my_file.xls") my_data <- read_excel("my_file.xlsx") The above R code, assumes that the file “my_file.xls” and “my_file.xlsx” is in your current working directory.

How to load xlsx file in R?

We can use the function to load our Excel file to R as follows: data1 <- xlsx ::read.xlsx("C:/ ... Your Path ... /iris.xlsx", # Read xlsx file with read.xlsx sheetIndex = 1) data1 <- xlsx::read.xlsx ("C:/ ... Your Path ... /iris.xlsx", # Read xlsx file with read.xlsx sheetIndex = 1) Call the data object data1 in your RStudio.

What does readxl package mean?

Using readxl package. The readxl package comes with the function read_excel() to read xls and xlsx files. Read both xls and xlsx files.


3 Answers

how about this:

//get zip stream

ZipFile zipFile = new ZipFile(billWater, Charset.forName("gbk"));


ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(billWater),  Charset.forName("gbk"));
//ZipEntry zipEntry;
//use openCsv 
 public static <T> List<T> processCSVFileByZip(ZipFile zipFile, ZipEntry zipEntry, Class<? extends T> clazz, Charset charset) throws IOException {
    Reader in = new InputStreamReader(zipFile.getInputStream(zipEntry), charset);
    return processCSVFile(in, clazz, charset, ',');
}

public static <T> List<T> processCSVFile(Reader in, Class<? extends T> clazz, Charset charset, char sep) {
    CsvToBean<T> csvToBean = new CsvToBeanBuilder(in)
            .withType(clazz).withSkipLines(1)
            .withIgnoreLeadingWhiteSpace(true).withSeparator(sep)
            .build();
    return csvToBean.parse();
}

//it seem dependency the xlsx file format

like image 50
Yy-- Avatar answered Oct 30 '22 06:10

Yy--


Okay, so I've tried replicating your excel file and I completly threw the XLSX2CSV out the window. I don't think the approach of converting the xlsx into csv is the right one because, as depending on your XLSX format, it can read all the empty rows (you probably know that because you've set a row counter of 60k). not only that but if we're taking into consideration fields, it may or may not cause incorrect output with special characters, like your problem.

What I've done is I've used this library https://github.com/davidpelfree/sjxlsx to read and re-write the file. It's pretty much straight-forward and the new xlsx generated file has the fields corrected.

I suggest you try this approach (maybe not with this lib), of trying to re-write the file in order to correct it.

like image 42
micael cunha Avatar answered Oct 30 '22 07:10

micael cunha


I think there are at least two open questions in here:

  1. Out of memory in WorkbookFactory.create() when opening old-style XLS files which are sparse

  2. XLSX2CSV is corrupting your new-style XLSX files, possibly due to "a certain character [incorrectly treated as] endRow"

For (1), I would say that you need to find a Java XLS library which either handles sparse files without allocating empty spaces, or a Java XLS library which can process the file in a streaming manner instead of the batch approach taken by WorkbookFactory

For (2), you need to find a Java XLSX library which won't corrupt your data.

I don't know of any good Java libraries for (1) or (2), sorry.

However, I would like to suggest that you write this script in Excel, rather than in Java. Excel has an excellent scripting language built in, Excel VBA, which can handle opening multiple files, extracting data from them etc.. Also, you can be confident that a script running in Excel VBA will not have any trouble with Excel features like sparse tables or XLSX parsing that you are encountering in Java.

(You might also like to take a step back and evaluate how long it might take to do this by hand, if it is a one-off job, compared to how long you will need to spend to script this task.)

Good luck!

like image 42
Rich Avatar answered Oct 30 '22 08:10

Rich