Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenXML Sax method for exporting 100K+ rows to Excel fast

I have been trying to improve the performance of the SAX method for writing to an xlsx. I know there is a limit of 1048576 rows in Excel. I have hit this limit only a few times. In most cases though I only write out about 125K to 250K rows (a large dataset). The code that I have tried doesn't seem to be as fast as it could be because of the many times it will write to the file. I would hope that there is some caching involved but it still seems like there is way too much disk access in the way the code works now.

The code below is similar to Using a template with OpenXML and SAX because I have written to a file using ClosedXML and then switch to SAX for the large content. The memory goes off the charts when trying to use ClosedXML for this many rows. So that is why I am using SAX.

        int numCols = dt.Columns.Count;
        int rowCnt = 0;
        //for (curRec = 0; curRec < totalRecs; curRec++)
        foreach (DataRow row in dt.Rows)
        {
            Row xlr = new Row();

            //starting of new row.
            //writer.WriteStartElement(xlr);

            for (int col = 0; col < numCols; ++col)
            {
                Cell cell = new Cell();
                CellValue v = new CellValue(row[col].ToString());

                {
                    string objDataType = row[col].GetType().ToString();
                    if (objDataType.Contains(TypeCode.Int32.ToString()) || objDataType.Contains(TypeCode.Int64.ToString()))
                    {
                        cell.DataType = new EnumValue<CellValues>(CellValues.Number);
                        //cell.CellValue = new CellValue(row[col].ToString());
                        cell.Append(v);
                    }
                    else if (objDataType.Contains(TypeCode.Decimal.ToString()) || objDataType.Contains("Single"))
                    {
                        cell.DataType = new EnumValue<CellValues>(CellValues.Number);
                        cell.Append(v);
                        //TODO: set the decimal qualifier - May be fixed elsewhere
                        cell.StyleIndex = 2;
                    }
                    else
                    {
                        //Add text to text cell
                        cell.DataType = new EnumValue<CellValues>(CellValues.String);
                        cell.Append(v);
                    }
                }

                if (colStyles != null && col < colStyles.Count)
                {
                    cell.StyleIndex = (UInt32Value)colStyles[col];
                }

                //writer.WriteElement(cell);
                xlr.Append(cell);
            }
            writer.WriteElement(xlr);
            //end row element
            //writer.WriteEndElement();
            ++rowCnt;
        }

This code is very close to examples I have seen out there. But the problem is it is still pretty slow. Changing from the individual cell writing to appending to the row and writing the row seems to improved the process by 10% on 125K rows.

Has anyone found a way to improve the writer or setup a way to write fewer times? Are there methods that could speed up this process?

Has anyone tried to setup some form of caching to improve performance?

like image 308
CaptainBli Avatar asked Mar 20 '23 00:03

CaptainBli


1 Answers

The general issue is that you shouldn't mix DOM and SAX methods together. Once you mix them, the performance is akin to just using DOM. The performance benefits of SAX happen when you go all in. To answer your questions first:

Has anyone found a way to improve the writer or setup a way to write fewer times? Are there methods that could speed up this process?

Don't mix the SAX writer with DOM manipulations. This means you shouldn't have manipulations of the SDK class properties or functions at all. So cell.Append() is out. So is cell.DataType or cell.StyleIndex.

When you do SAX, you go all in. (that sounds slightly provocative...) For example:

for (int i = 1; i <= 50000; ++i)
{
    oxa = new List<OpenXmlAttribute>();
    // this is the row index
    oxa.Add(new OpenXmlAttribute("r", null, i.ToString()));

    oxw.WriteStartElement(new Row(), oxa);

    for (int j = 1; j <= 100; ++j)
    {
        oxa = new List<OpenXmlAttribute>();
        // this is the data type ("t"), with CellValues.String ("str")
        oxa.Add(new OpenXmlAttribute("t", null, "str"));

        // it's suggested you also have the cell reference, but
        // you'll have to calculate the correct cell reference yourself.
        // Here's an example:
        //oxa.Add(new OpenXmlAttribute("r", null, "A1"));

        oxw.WriteStartElement(new Cell(), oxa);

        oxw.WriteElement(new CellValue(string.Format("R{0}C{1}", i, j)));

        // this is for Cell
        oxw.WriteEndElement();
    }

    // this is for Row
    oxw.WriteEndElement();
}

where oxa is a List and oxw is the SAX writer class OpenXmlWriter. More details on my article here.

There's no real way to cache the SAX operations. They're like a series of printf statements. You can probably write a helper function that just do the WriteStartElement(), WriteElement() and WriteEndElement() functions in a chunk (to write a complete Cell class for example).

like image 172
Vincent Tan Avatar answered Apr 02 '23 16:04

Vincent Tan