This is a very weird situation, first the code...
The code
private List<DispatchInvoiceCTNDataModel> WorksheetToDataTableForInvoiceCTN(ExcelWorksheet excelWorksheet, int month, int year)
{
int totalRows = excelWorksheet.Dimension.End.Row;
int totalCols = excelWorksheet.Dimension.End.Column;
DataTable dt = new DataTable(excelWorksheet.Name);
// for (int i = 1; i <= totalRows; i++)
Parallel.For(1, totalRows + 1, (i) =>
{
DataRow dr = null;
if (i > 1)
{
dr = dt.Rows.Add();
}
for (int j = 1; j <= totalCols; j++)
{
if (i == 1)
{
var colName = excelWorksheet.Cells[i, j].Value.ToString().Replace(" ", String.Empty);
lock (lockObject)
{
if (!dt.Columns.Contains(colName))
dt.Columns.Add(colName);
}
}
else
{
dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;
}
}
});
var excelDataModel = dt.ToList<DispatchInvoiceCTNDataModel>();
// now we have mapped everything expect for the IDs
excelDataModel = MapInvoiceCTNIDs(excelDataModel, month, year, excelWorksheet);
return excelDataModel;
}
The problem
When I am running the code on random occasion it would throw IndexOutOfRangeException
on the line
dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;
For some random value of i
and j
. When I step over the code (F10
), since it is running in a ParallelLoop, some other thread kicks and and other exception is throw, that other exception is something like (I could not reproduce it, it just came once, but I think it is also related to this threading issue) Column 31 not found in excelWorksheet
. I don't understand how could any of these exception occur?
case 1
The IndexOutOfRangeException
should not even occur, as the only code/shared variable dt
I have locked around accessing it, rest all is either local or parameter so there should not have any thread related issue. Also, if I check the value of i
or j
in debug window, or even evaluate this whole expression dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;
or a part of it in Debug window, then it works just fine, no errors of any sort or nothing.
case 2
For the second error, (which unfortunately is not reproducing now, but still) it should not have occurred as there are 33 columns in the excel.
More Code
In case someone might need how this method was called
using (var xlPackage = new ExcelPackage(viewModel.postedFile.InputStream))
{
ExcelWorksheets worksheets = xlPackage.Workbook.Worksheets;
// other stuff
var entities = this.WorksheetToDataTableForInvoiceCTN(worksheets[1], viewModel.Month, viewModel.Year);
// other stuff
}
Other
If someone needs more code/details let me know.
Update
Okay, to answer some comments. It is working fine when using for
loop, I have tested that many times. Also, there is no particular value of i
or j
for which the exception is thrown. Sometimes it is 8, 6
at other time it could be anything, say 19,2
or anything. Also, in the Parallel
loop the +1
is not doing any damage as the msdn documentation says it is exclusive not inclusive. Also, if that were the issue I would only be getting exception at the last index (the last value of i) but that's not the case.
UPDATE 2
The given answer to lock around the code
dr = dt.Rows.Add();
I have changed it to
lock(lockObject) {
dr = dt.Rows.Add();
}
It is not working. Now I am getting ArgumentOutOfRangeException
, still if I run this in debug window, it just works fine.
Update 3
Here is the full exception detail, after update 2 (I am getting this on the line that I mentioned in update 2)
System.ArgumentOutOfRangeException was unhandled by user code
HResult=-2146233086
Message=Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index
Source=mscorlib
ParamName=index
StackTrace:
at System.ThrowHelper.ThrowArgumentOutOfRangeException()
at System.Collections.Generic.List`1.get_Item(Int32 index)
at System.Data.RecordManager.NewRecordBase()
at System.Data.DataTable.NewRecordFromArray(Object[] value)
at System.Data.DataRowCollection.Add(Object[] values)
at AdminEntity.BAL.Service.ExcelImportServices.<>c__DisplayClass2e.<WorksheetToDataTableForInvoiceCTN>b__2d(Int32 i) in C:\Projects\Manager\Admin\AdminEntity\AdminEntity.BAL\Service\ExcelImportServices.cs:line 578
at System.Threading.Tasks.Parallel.<>c__DisplayClassf`1.<ForWorker>b__c()
InnerException:
Okay. So there are a few problems with your existing code, most of which have been touched on by others:
Parallel.For(0, 10, (i) => { Console.WriteLine(i); });
, the first four threads (on a quad-core system) will be queued with i
values 0-3. But any one of those threads may start or finish executing before any other. So you may see 2 printed first, whereupon thread 4 will be queued. Then thread 1 might complete, and thread 5 will be queued. Then thread 4 might complete, even before threads 0 or 3 do. Etc., etc. TL;DR: You CANNOT assume an ordered output in parallel.??
, which evaluates whether the preceding value is null, and if so, assigns the subsequent value. For example, foo = bar ?? ""
is the equivalent of if (bar == null) { foo = ""; } else { foo = bar; }
.So right off the bat, your code should look more like this:
private void ReadIntoTable(ExcelWorksheet sheet)
{
DataTable dt = new DataTable(sheet.Name);
int height = sheet.Dimension.Rows;
int width = sheet.Dimension.Columns;
for (int j = 1; j <= width; j++)
{
string colText = (sheet.Cells[1, j].Value ?? "").ToString();
dt.Columns.Add(colText);
}
for (int i = 2; i <= height; i++)
{
dt.Rows.Add();
}
Parallel.For(1, height, (i) =>
{
var row = dt.Rows[i - 1];
for (int j = 0; j < width; j++)
{
string str = (sheet.Cells[i + 1, j + 1].Value ?? "").ToString();
row[j] = str;
}
});
// convert to your special Excel data model
// ...
}
Much better!
...but it still doesn't work!
Yep, it still fails with an IndexOutOfRange exception. However, since we took your original line dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;
and split it into a couple pieces, we can see exactly which part it fails on. And it fails on row[j] = str;
, where we actually write the text into the row.
Uh-oh.
MSDN: DataRow Class
Thread Safety
This type is safe for multithreaded read operations. You must synchronize any write operations.
*sigh*. Yeah. Who knows why DataRow uses static anything when assigning values, but there you have it; writing to DataRow isn't thread-safe. And sure enough, doing this...
private static object s_lockObject = "";
private void ReadIntoTable(ExcelWorksheet sheet)
{
// ...
lock (s_lockObject)
{
row[j] = str;
}
// ...
}
...magically makes it work. Granted, it completely destroys the parallelism, but it works.
Well, it almost completely destroys the parallelism. Anecdotal experimentation on an Excel file with 18 columns and 46319 rows shows that the Parallel.For() loop creates its DataTable in about 3.2s on average, whereas replacing Parallel.For() with for (int i = 1; i < height; i++)
takes about 3.5s. My guess is that, since the lock is only there for writing data, there is a very small benefit realized by writing data on one thread and processing text on the other(s).
Of course, if you can create your own DataTable replacement class, you can see a much larger speed boost. For example:
string[,] rows = new string[height, width];
Parallel.For(1, height, (i) =>
{
for (int j = 0; j < width; j++)
{
rows[i - 1, j] = (sheet.Cells[i + 1, j + 1].Value ?? "").ToString();
}
});
This executes in about 1.8s on average for the same Excel table mentioned above - about half the time of our barely-parallel DataTable. Replacing the Parallel.For() with the standard for() in this snippet makes it run in about 2.5s.
So you can see a significant performance boost from parallelism, but also from a custom data structure - although the viability of the latter will depend on your ability to easily convert the returned values to that Excel data model thing, whatever it is.
The line dr = dt.Rows.Add();
is not thread safe, you are corrupting the internal state of the array in the DataTable that hold the rows for the table.
At first glance changing it to
if (i > 1)
{
lock (lockObject)
{
dr = dt.Rows.Add();
}
}
should fix it, but that does not mean other thread safety problems are not there from excelWorksheet.Cells
being accessed from multiple threads. (If excelWorksheet
is this class and you are running a STA main thread (WinForms or WPF) COM should marshal the cross thread calls for you)
EDIT: New thory, the problem comes from the fact that you are setting up your schema inside the parallel loop and attempting to write to it at the same time. Pull out all of the i == 1
logic to before the loop and then start at i == 2
private List<DispatchInvoiceCTNDataModel> WorksheetToDataTableForInvoiceCTN(ExcelWorksheet excelWorksheet, int month, int year)
{
int totalRows = excelWorksheet.Dimension.End.Row;
int totalCols = excelWorksheet.Dimension.End.Column;
DataTable dt = new DataTable(excelWorksheet.Name);
//Build the schema before we loop in parallel.
for (int j = 1; j <= totalCols; j++)
{
var colName = excelWorksheet.Cells[1, j].Value.ToString().Replace(" ", String.Empty);
if (!dt.Columns.Contains(colName))
dt.Columns.Add(colName);
}
Parallel.For(2, totalRows + 1, (i) =>
{
DataRow dr = null;
lock(lockObject) {
dr = dt.Rows.Add();
}
for (int j = 1; j <= totalCols; j++)
{
dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;
}
});
var excelDataModel = dt.ToList<DispatchInvoiceCTNDataModel>();
// now we have mapped everything expect for the IDs
excelDataModel = MapInvoiceCTNIDs(excelDataModel, month, year, excelWorksheet);
return excelDataModel;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With