I am trying to extract all text data from an Excel document in C# and am having performance issues. In the following code I open the Workbook, loop over all worksheets, and loop over all cells in the used range, extracting the text from each cell as I go. The problem is, this takes 14 seconds to execute.
public class ExcelFile
{
public string Path = @"C:\test.xlsx";
private Excel.Application xl = new Excel.Application();
private Excel.Workbook WB;
public string FullText;
private Excel.Range rng;
private Dictionary<string, string> Variables;
public ExcelFile()
{
WB = xl.Workbooks.Open(Path);
xl.Visible = true;
foreach (Excel.Worksheet CurrentWS in WB.Worksheets)
{
rng = CurrentWS.UsedRange;
for (int i = 1; i < rng.Count; i++)
{ FullText += rng.Cells[i].Value; }
}
WB.Close(false);
xl.Quit();
}
}
Whereas in VBA I would do something like this, which takes ~1 second:
Sub run()
Dim strText As String
For Each ws In ActiveWorkbook.Sheets
For Each c In ws.UsedRange
strText = strText & c.Text
Next c
Next ws
End Sub
Or, even faster (less than 1 second):
Sub RunFast()
Dim strText As String
Dim varCells As Variant
For Each ws In ActiveWorkbook.Sheets
varCells = ws.UsedRange
For i = 1 To UBound(varCells, 1)
For j = 1 To UBound(varCells, 2)
strText = strText & CStr(varCells(i, j))
Next j
Next i
Next ws
End Sub
Perhaps something is happening in the for loop in C# that I'm not aware of? Is it possible to load a range into an array-type object (as in my last example) to allow iteration over just the values, not the cell objects?
Excel and C# run in different environments completely. C# runs in the .NET framework using managed memory while Excel is a native C++ application and runs in unmanaged memory. Translating data between these two (a process called "marshaling") is extremely expensive in terms of performance.
Tweaking your code isn't going to help. For loops, string construction, etc. are all blazingly fast compared to the marshaling process. The only way you are going to get significantly better performance is to reduce the number of trips that have to cross the interprocess boundary. Extracting data cell by cell is never going to get you the performance you want.
Here are a couple options:
Write a sub or function in VBA that does everything you want, then call that sub or function via interop. Walkthrough.
Use interop to save the worksheet to a temporary file in CSV format, then open the file using C#. You will need to loop through and parse the file to get it into a useful data structure, but this loop will go much faster.
Use interop to save a range of cells to the clipboard, then use C# to read the clipboard directly.
I use this function. The loops are only for converting to array starting at index 0, the main work is done in object[,] tmp = range.Value
.
public object[,] GetTable(int row, int col, int width, int height)
{
object[,] arr = new object[height, width];
Range c1 = (Range)Worksheet.Cells[row + 1, col + 1];
Range c2 = (Range)Worksheet.Cells[row + height, col + width];
Range range = Worksheet.get_Range(c1, c2);
object[,] tmp = range.Value;
for (int i = 0; i < height; ++i)
{
for (int j = 0; j < width; ++j)
{
arr[i, j] = tmp[i + tmp.GetLowerBound(0), j + tmp.GetLowerBound(1)];
}
}
return arr;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With