Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Index out of range exception in using ParallelFor loop

This is a very weird situation, first the code...

The code

 private List<DispatchInvoiceCTNDataModel> WorksheetToDataTableForInvoiceCTN(ExcelWorksheet excelWorksheet, int month, int year)
        {
            int totalRows = excelWorksheet.Dimension.End.Row;
            int totalCols = excelWorksheet.Dimension.End.Column;
            DataTable dt = new DataTable(excelWorksheet.Name);
            // for (int i = 1; i <= totalRows; i++)
            Parallel.For(1, totalRows + 1, (i) =>
            {
                DataRow dr = null;
                if (i > 1)
                {
                    dr = dt.Rows.Add();
                }
                for (int j = 1; j <= totalCols; j++)
                {
                    if (i == 1)
                    {
                        var colName = excelWorksheet.Cells[i, j].Value.ToString().Replace(" ", String.Empty);
                        lock (lockObject)
                        {
                            if (!dt.Columns.Contains(colName))
                                dt.Columns.Add(colName);
                        }
                    }
                    else
                    {
                        dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;
                    }
                }
            });
            var excelDataModel = dt.ToList<DispatchInvoiceCTNDataModel>();
            // now we have mapped everything expect for the IDs
            excelDataModel = MapInvoiceCTNIDs(excelDataModel, month, year, excelWorksheet);
            return excelDataModel;
        }

The problem
When I am running the code on random occasion it would throw IndexOutOfRangeException on the line

  dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;

For some random value of i and j. When I step over the code (F10), since it is running in a ParallelLoop, some other thread kicks and and other exception is throw, that other exception is something like (I could not reproduce it, it just came once, but I think it is also related to this threading issue) Column 31 not found in excelWorksheet. I don't understand how could any of these exception occur?

case 1
The IndexOutOfRangeException should not even occur, as the only code/shared variable dt I have locked around accessing it, rest all is either local or parameter so there should not have any thread related issue. Also, if I check the value of i or j in debug window, or even evaluate this whole expression dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null; or a part of it in Debug window, then it works just fine, no errors of any sort or nothing.

case 2
For the second error, (which unfortunately is not reproducing now, but still) it should not have occurred as there are 33 columns in the excel.

More Code
In case someone might need how this method was called

using (var xlPackage = new ExcelPackage(viewModel.postedFile.InputStream))
            {
                ExcelWorksheets worksheets = xlPackage.Workbook.Worksheets;

                // other stuff 
                var entities = this.WorksheetToDataTableForInvoiceCTN(worksheets[1], viewModel.Month, viewModel.Year);
                // other stuff 
            }

Other
If someone needs more code/details let me know.

Update
Okay, to answer some comments. It is working fine when using for loop, I have tested that many times. Also, there is no particular value of i or j for which the exception is thrown. Sometimes it is 8, 6 at other time it could be anything, say 19,2or anything. Also, in the Parallel loop the +1 is not doing any damage as the msdn documentation says it is exclusive not inclusive. Also, if that were the issue I would only be getting exception at the last index (the last value of i) but that's not the case.

UPDATE 2
The given answer to lock around the code

  dr = dt.Rows.Add();

I have changed it to

  lock(lockObject) {
      dr = dt.Rows.Add();
  }

It is not working. Now I am getting ArgumentOutOfRangeException, still if I run this in debug window, it just works fine.

Update 3
Here is the full exception detail, after update 2 (I am getting this on the line that I mentioned in update 2)

System.ArgumentOutOfRangeException was unhandled by user code
  HResult=-2146233086
  Message=Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index
  Source=mscorlib
  ParamName=index
  StackTrace:
       at System.ThrowHelper.ThrowArgumentOutOfRangeException()
       at System.Collections.Generic.List`1.get_Item(Int32 index)
       at System.Data.RecordManager.NewRecordBase()
       at System.Data.DataTable.NewRecordFromArray(Object[] value)
       at System.Data.DataRowCollection.Add(Object[] values)
       at AdminEntity.BAL.Service.ExcelImportServices.<>c__DisplayClass2e.<WorksheetToDataTableForInvoiceCTN>b__2d(Int32 i) in C:\Projects\Manager\Admin\AdminEntity\AdminEntity.BAL\Service\ExcelImportServices.cs:line 578
       at System.Threading.Tasks.Parallel.<>c__DisplayClassf`1.<ForWorker>b__c()
  InnerException: 
like image 595
Razort4x Avatar asked Apr 09 '15 14:04

Razort4x


Video Answer


2 Answers

Okay. So there are a few problems with your existing code, most of which have been touched on by others:

  • Parallel threads are at the mercy of the OS scheduler; therefore, although threads are queued in-order, they may (and often do) complete execution out-of-order. For example, given Parallel.For(0, 10, (i) => { Console.WriteLine(i); });, the first four threads (on a quad-core system) will be queued with i values 0-3. But any one of those threads may start or finish executing before any other. So you may see 2 printed first, whereupon thread 4 will be queued. Then thread 1 might complete, and thread 5 will be queued. Then thread 4 might complete, even before threads 0 or 3 do. Etc., etc. TL;DR: You CANNOT assume an ordered output in parallel.
  • Given that, as @ScottChamberlain noted, it's a very bad idea to do column generation within your parallel loop - because you have no guarantee that the thread doing column generation will create all your columns before another thread starts assigning data in rows to those column indices. E.g. you could be assigning data to cell [0,4] before your table actually has a fifth column.
    • It's worth noting that this should really be broken out of the loop anyway, purely from a code cleanliness perspective. At the moment, you have two nested loops, each with special behavior on a single iteration; better to separate that setup logic into its own loop and leave the main loop to assign data and nothing else.
  • For the same reason, you should not be creating new rows in the table within your parallel loop - because you have no guarantee that the rows will be added to the table in their source order. Break that out too, and access rows within the loop based on their index.
  • Some have mentioned using DataRow.NewRow() before Rows.Add(). Technically, NewRow() is the right way to go about things, but the actual recommended access pattern is a bit different than is probably appropriate for a cell-by-cell function, particularly when parallelism is intended (see MSDN: DataTable.NewRow Method). The fact remains that adding a new, blank row to a DataTable with Rows.Add() and populating it afterwards functions properly.
  • You can clean up your string formatting with the null-coalescing operator ??, which evaluates whether the preceding value is null, and if so, assigns the subsequent value. For example, foo = bar ?? "" is the equivalent of if (bar == null) { foo = ""; } else { foo = bar; }.

So right off the bat, your code should look more like this:

private void ReadIntoTable(ExcelWorksheet sheet)
{
    DataTable dt = new DataTable(sheet.Name);
    int height = sheet.Dimension.Rows;
    int width = sheet.Dimension.Columns;

    for (int j = 1; j <= width; j++)
    {
        string colText = (sheet.Cells[1, j].Value ?? "").ToString();
        dt.Columns.Add(colText);
    }
    for (int i = 2; i <= height; i++)
    {
        dt.Rows.Add();
    }

    Parallel.For(1, height, (i) =>
    {
        var row = dt.Rows[i - 1];
        for (int j = 0; j < width; j++)
        {
            string str = (sheet.Cells[i + 1, j + 1].Value ?? "").ToString();
            row[j] = str;
        }
    });

    // convert to your special Excel data model
    // ...
}

Much better!

...but it still doesn't work!

Yep, it still fails with an IndexOutOfRange exception. However, since we took your original line dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null; and split it into a couple pieces, we can see exactly which part it fails on. And it fails on row[j] = str;, where we actually write the text into the row.

Uh-oh.

MSDN: DataRow Class

Thread Safety

This type is safe for multithreaded read operations. You must synchronize any write operations.

*sigh*. Yeah. Who knows why DataRow uses static anything when assigning values, but there you have it; writing to DataRow isn't thread-safe. And sure enough, doing this...

private static object s_lockObject = "";

private void ReadIntoTable(ExcelWorksheet sheet)
{
    // ...
    lock (s_lockObject)
    {
        row[j] = str;
    }
    // ...
}

...magically makes it work. Granted, it completely destroys the parallelism, but it works.

Well, it almost completely destroys the parallelism. Anecdotal experimentation on an Excel file with 18 columns and 46319 rows shows that the Parallel.For() loop creates its DataTable in about 3.2s on average, whereas replacing Parallel.For() with for (int i = 1; i < height; i++) takes about 3.5s. My guess is that, since the lock is only there for writing data, there is a very small benefit realized by writing data on one thread and processing text on the other(s).

Of course, if you can create your own DataTable replacement class, you can see a much larger speed boost. For example:

string[,] rows = new string[height, width];
Parallel.For(1, height, (i) =>
{
    for (int j = 0; j < width; j++)
    {
        rows[i - 1, j] = (sheet.Cells[i + 1, j + 1].Value ?? "").ToString();
    }
});

This executes in about 1.8s on average for the same Excel table mentioned above - about half the time of our barely-parallel DataTable. Replacing the Parallel.For() with the standard for() in this snippet makes it run in about 2.5s.

So you can see a significant performance boost from parallelism, but also from a custom data structure - although the viability of the latter will depend on your ability to easily convert the returned values to that Excel data model thing, whatever it is.

like image 155
Diosjenin Avatar answered Oct 18 '22 22:10

Diosjenin


The line dr = dt.Rows.Add(); is not thread safe, you are corrupting the internal state of the array in the DataTable that hold the rows for the table.

At first glance changing it to

if (i > 1)
{
    lock (lockObject)
    {
        dr = dt.Rows.Add();
    }
}

should fix it, but that does not mean other thread safety problems are not there from excelWorksheet.Cells being accessed from multiple threads. (If excelWorksheet is this class and you are running a STA main thread (WinForms or WPF) COM should marshal the cross thread calls for you)


EDIT: New thory, the problem comes from the fact that you are setting up your schema inside the parallel loop and attempting to write to it at the same time. Pull out all of the i == 1 logic to before the loop and then start at i == 2

private List<DispatchInvoiceCTNDataModel> WorksheetToDataTableForInvoiceCTN(ExcelWorksheet excelWorksheet, int month, int year)
{
    int totalRows = excelWorksheet.Dimension.End.Row;
    int totalCols = excelWorksheet.Dimension.End.Column;
    DataTable dt = new DataTable(excelWorksheet.Name);

    //Build the schema before we loop in parallel.
    for (int j = 1; j <= totalCols; j++)
    {
        var colName = excelWorksheet.Cells[1, j].Value.ToString().Replace(" ", String.Empty);
        if (!dt.Columns.Contains(colName))
            dt.Columns.Add(colName);
    }

    Parallel.For(2, totalRows + 1, (i) =>
    {
        DataRow dr = null;
        lock(lockObject) {
            dr = dt.Rows.Add();
        }
        for (int j = 1; j <= totalCols; j++)
        {
            dr[j - 1] = excelWorksheet.Cells[i, j].Value != null ? excelWorksheet.Cells[i, j].Value.ToString() : null;
        }
    });
    var excelDataModel = dt.ToList<DispatchInvoiceCTNDataModel>();
    // now we have mapped everything expect for the IDs
    excelDataModel = MapInvoiceCTNIDs(excelDataModel, month, year, excelWorksheet);
    return excelDataModel;
}
like image 44
Scott Chamberlain Avatar answered Oct 18 '22 23:10

Scott Chamberlain