Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Excel Interop Slow when looping through cells

I am trying to extract all text data from an Excel document in C# and am having performance issues. In the following code I open the Workbook, loop over all worksheets, and loop over all cells in the used range, extracting the text from each cell as I go. The problem is, this takes 14 seconds to execute.

public class ExcelFile
{
    public string Path = @"C:\test.xlsx";
    private Excel.Application xl = new Excel.Application();
    private Excel.Workbook WB;
    public string FullText;
    private Excel.Range rng;
    private Dictionary<string, string> Variables;
    public ExcelFile()
    {
        WB = xl.Workbooks.Open(Path);
        xl.Visible = true;
        foreach (Excel.Worksheet CurrentWS in WB.Worksheets)
        {
            rng = CurrentWS.UsedRange;
            for (int i = 1; i < rng.Count; i++)
            { FullText += rng.Cells[i].Value; }
        }
        WB.Close(false);
        xl.Quit();
    }
}

Whereas in VBA I would do something like this, which takes ~1 second:

Sub run()
    Dim strText As String
    For Each ws In ActiveWorkbook.Sheets
        For Each c In ws.UsedRange
            strText = strText & c.Text
        Next c
    Next ws
End Sub

Or, even faster (less than 1 second):

Sub RunFast()
    Dim strText As String
    Dim varCells As Variant
    For Each ws In ActiveWorkbook.Sheets
        varCells = ws.UsedRange
        For i = 1 To UBound(varCells, 1)
            For j = 1 To UBound(varCells, 2)
                strText = strText & CStr(varCells(i, j))
            Next j
        Next i
    Next ws
End Sub

Perhaps something is happening in the for loop in C# that I'm not aware of? Is it possible to load a range into an array-type object (as in my last example) to allow iteration over just the values, not the cell objects?

like image 701
pwwolff Avatar asked Mar 04 '17 23:03

pwwolff


2 Answers

Excel and C# run in different environments completely. C# runs in the .NET framework using managed memory while Excel is a native C++ application and runs in unmanaged memory. Translating data between these two (a process called "marshaling") is extremely expensive in terms of performance.

Tweaking your code isn't going to help. For loops, string construction, etc. are all blazingly fast compared to the marshaling process. The only way you are going to get significantly better performance is to reduce the number of trips that have to cross the interprocess boundary. Extracting data cell by cell is never going to get you the performance you want.

Here are a couple options:

  1. Write a sub or function in VBA that does everything you want, then call that sub or function via interop. Walkthrough.

  2. Use interop to save the worksheet to a temporary file in CSV format, then open the file using C#. You will need to loop through and parse the file to get it into a useful data structure, but this loop will go much faster.

  3. Use interop to save a range of cells to the clipboard, then use C# to read the clipboard directly.

like image 194
John Wu Avatar answered Oct 14 '22 05:10

John Wu


I use this function. The loops are only for converting to array starting at index 0, the main work is done in object[,] tmp = range.Value.

public object[,] GetTable(int row, int col, int width, int height)
{
    object[,] arr = new object[height, width];

    Range c1 = (Range)Worksheet.Cells[row + 1, col + 1];
    Range c2 = (Range)Worksheet.Cells[row + height, col + width];
    Range range = Worksheet.get_Range(c1, c2);

    object[,] tmp = range.Value;

    for (int i = 0; i < height; ++i)
    {
        for (int j = 0; j < width; ++j)
        {
            arr[i, j] = tmp[i + tmp.GetLowerBound(0), j + tmp.GetLowerBound(1)];
        }
    }                 

    return arr;
}
like image 34
Antonín Lejsek Avatar answered Oct 14 '22 04:10

Antonín Lejsek