I'm reading data from StreamReader line by line inside the following while statement.
while (!sr.EndOfStream)
{
string[] rows = sr.ReadLine().Split(sep);
int incr = 0;
foreach (var item in rows)
{
if (item == "NA" | item == "" | item == "NULL" | string.IsNullOrEmpty(item) | string.IsNullOrWhiteSpace(item))
{
rows[incr] = null;
}
++incr;
}
// another logic ...
}
The code works fine but it is very slow because of huge csv files (500,000,000 rows and hundreds of columns). Is there any faster way how to check data (if it is "NA", "", ... should be replaced by null). Currently I'm using foreach with incr variable for updating item inside foreach.
I was wondering about linq or lambda would be faster but I'm very new in these areas.
The forloop is faster than the foreach loop if the array must only be accessed once per iteration.
Foreach performance is approximately 6 times slower than FOR / FOREACH performance. The FOR loop without length caching works 3 times slower on lists, comparing to arrays. The FOR loop with length caching works 2 times slower on lists, comparing to arrays.
The foreach loop in C# iterates items in a collection, like an array or a list. It proves useful for traversing through each element in the collection and displaying them. The foreach loop is an easier and more readable alternative to for loop.
Firstly, don't use foreach
when changing the collection, it's not a good habit, especially when you already use a counter variable.
This loop could be made multi-threaded using Parallel.For
this way:
Code using normal for:
while (!sr.EndOfStream)
{
string[] rows = sr.ReadLine().Split(sep);
for (int i = 0; i < rows.Length; i++)
{
//I simplified your checks, this is safer and simplier.
if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL")
{
rows[i] = null;
}
}
// another logic ...
}
Code using Parallel.For
while (!sr.EndOfStream)
{
string[] rows = sr.ReadLine().Split(sep);
Parallel.For(0, rows.Length, i =>
{
if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL")
{
rows[i] = null;
}
});
// another logic ...
}
EDIT
We could approach this from another side, but I don't recommend this, because this requires a LOT of RAM, because it has to read the entire file into memory.
string[] lines = File.ReadAllLines("test.txt");
Parallel.For(0, lines.Length, x =>
{
string[] rows = lines[x].Split(sep);
for (int i = 0; i < rows.Length; i++)
{
if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL")
{
rows[i] = null;
}
}
});
But I don't think that this is worth it. You decide. These kinds of operations don't play well with parallelization, because they take so little time to compute, that it is too much overhead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With