Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster data checking and updating inside foreach loop

Tags:

c#

linq

I'm reading data from StreamReader line by line inside the following while statement.

while (!sr.EndOfStream)
{
   string[] rows = sr.ReadLine().Split(sep);

   int incr = 0;
   foreach (var item in rows)
   {
       if (item == "NA" | item == "" | item == "NULL" | string.IsNullOrEmpty(item) | string.IsNullOrWhiteSpace(item))
       {
           rows[incr] = null;
       }
       ++incr;
   }
    // another logic ...
}

The code works fine but it is very slow because of huge csv files (500,000,000 rows and hundreds of columns). Is there any faster way how to check data (if it is "NA", "", ... should be replaced by null). Currently I'm using foreach with incr variable for updating item inside foreach.

I was wondering about linq or lambda would be faster but I'm very new in these areas.

like image 962
mateskabe Avatar asked Dec 29 '17 08:12

mateskabe


People also ask

Which is faster for loop or foreach?

The forloop is faster than the foreach loop if the array must only be accessed once per iteration.

Are foreach loops slower?

Foreach performance is approximately 6 times slower than FOR / FOREACH performance. The FOR loop without length caching works 3 times slower on lists, comparing to arrays. The FOR loop with length caching works 2 times slower on lists, comparing to arrays.

Why do we use foreach loop?

The foreach loop in C# iterates items in a collection, like an array or a list. It proves useful for traversing through each element in the collection and displaying them. The foreach loop is an easier and more readable alternative to for loop.


1 Answers

Firstly, don't use foreach when changing the collection, it's not a good habit, especially when you already use a counter variable.

This loop could be made multi-threaded using Parallel.For this way:

Code using normal for:

while (!sr.EndOfStream)
{
    string[] rows = sr.ReadLine().Split(sep);

    for (int i = 0; i < rows.Length; i++)
    {
        //I simplified your checks, this is safer and simplier.
        if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL")
        {
            rows[i] = null;
        }
    }
    // another logic ...
}

Code using Parallel.For

while (!sr.EndOfStream)
{
    string[] rows = sr.ReadLine().Split(sep);

    Parallel.For(0, rows.Length, i =>
    {
        if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL")
        {
            rows[i] = null;
        }
    });
    // another logic ...
}

EDIT

We could approach this from another side, but I don't recommend this, because this requires a LOT of RAM, because it has to read the entire file into memory.

string[] lines = File.ReadAllLines("test.txt");
Parallel.For(0, lines.Length, x =>
{
    string[] rows = lines[x].Split(sep);

    for (int i = 0; i < rows.Length; i++)
    {
        if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL")
        {
            rows[i] = null;
        }
    }
});

But I don't think that this is worth it. You decide. These kinds of operations don't play well with parallelization, because they take so little time to compute, that it is too much overhead.

like image 102
rokkerboci Avatar answered Sep 23 '22 12:09

rokkerboci