I have a question regarding Excel (Xlsx) files loading in C#. I have implemented Excel loading with NPOI 2.0 but the performance was quite bad (15 to 25 seconds loading time with 10000 rows and 60 columns (run on Win7 with a Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz (4 CPUs), ~2.5GHz)). I thought this was because NPOI 2.0 is still in beta, so I tried another library called EPPlus and it still takes about the same amount of time to load the Excel file.
Here is how I load it with EPPlus:
var existingFile = new FileInfo(path);
var excelData = new ExcelViewModel(path);
// Open and read the XlSX file.
using (var package = new ExcelPackage(existingFile))
{
// Get the work book in the file
ExcelWorkbook workBook = package.Workbook;
if (workBook != null)
{
// Here is some initializing......
var viewSheetModel = new ExcelSheetViewModel(sheet.Name, numberOfColumns, titles);
for (var row = titleRowIndex + 1; row <= end.Row; ++row)
{
var viewRowModel = new ExcelRowViewModel();
for (int column = start.Column; column <= end.Column; ++column)
{
var cell = sheet.Cells[row, column];
viewRowModel.AddCellValue(cell.Value != null ? cell.Value.ToString() : string.Empty);
}
viewSheetModel.Rows.Add(viewRowModel);
}
excelData.AddSheet(viewSheetModel);
}
}
According to the dotTrace Profiler about 40% of the time is wasted in the get_Workbook method (which is called by accessing the "package.Workbook" Property), and then another 30% in get_Item and get_Value calls and then 5% in the method AddCellValue (which is my data model) and the rest of the time is spread into various method calls.
Is there something i'm doing wrong, or is this performance normal?
Cheers
What I've found is that the FOR loops are very expensive. Here's how I addressed getting a 85000 x 26 sheet loaded in just over 1 second.
ExcelWorksheet ws = ...
Int32 maxLength = ws.Dimension.End.Row + 1;
Int32 maxWidth = ws.Dimension.End.Column + 1;
// Fetch the entire sheet as one huge range
ExcelRange cells = ws.Cells[1, 1, maxLength, maxWidth];
// cells.Values now contains a 2 dimensional object array
// Feel free to stop here
// I wanted a jagged array of type string, so I converted it.
// Start by converting the 2D array to 1D.
object[] obj_values = ((object[,]) cells.Value).Cast<object>().ToArray();
// Convert object[] to string[]
string[] str_values = Array.ConvertAll(obj_values, p => p == null ? "" : p.ToString());
// Chunk 1D array back into a jagged array and convert nulls to String.Empty
Int32 j = 0;
string[][] values = str_values.GroupBy(p => j++ / maxWidth).Select(q => q.ToArray()).ToArray();
// This was very fast compared to FOR loops!
It appears to me that, yes, the observed performance is normal for EPPlus. I'm encountering similar issues five years later with EPPlus 4.5.2.1. Profiling gives 59% in get_Worksheet and single threaded spreadsheet read on an i5-4200U is managing about 120,000 cells/second. While this is an improvement from the ~50,000 cells/second mentioned in the original post it may well be down to hardware differences.
For comparison, SpreadsheetLight benchmarks 425,000 cells/second on what appears to be an i7-7700, which is about three times faster than I'm measuring for EPPlus. My homebrew, unoptimized parser written in C# reads around 430,000 cells/second retrieving the same data from a .csv file and @Tim Andersen's SpreadsheetGear comment above normalizes to 400,000 cells/second. I've not yet located comparative benchmarks between EPPlus and other Excel libraries such as ClosedXML, NPIO, Aspose, or Microsoft's Open XML SDK.
Within EPPlus, the approaches I've profiled are, from fastest to slowest,
ExcelWorksheet.Cells[1, 1, dimension.Rows, dimension.Columns].Value (essentially @Kevin M's answer but without the off by one)ExcelWorksheet.GetValue<string>(row, column)ExcelWorksheet.GetValue(row, column)ExcelWorksheet.Cells[row, column].TextExcelWorksheet.Cells[row, column].ValueAs of EPPlus 4.5.2.1, obtaining the object[,] from ExcelRange.Value in the first approach is a few percent faster than GetValue() overloads. Cell by cell access through Cells[row, column] is about 25% slower than GetValue().
Review of the EPPlus sources suggests code changes within EPPlus would be needed for improvement. Workbook access remains expensive on all paths I've profiled and it's single threaded, preventing linear scaling from additional cores. There's also nontrivial overhead from cell address translation and hoistable calls into System.Globalization that's consistent with other libraries being roughly three times faster than EPPlus.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With