We have a short method that parse .csv file to a lookup:
ILookup<string, DgvItems> ParseCsv( string fileName )
{
var file = File.ReadAllLines( fileName );
return file.Skip( 1 ).Select( line => new DgvItems( line ) ).ToLookup( item => item.StocksID );
}
And the definition of DgvItems:
public class DgvItems
{
public string DealDate { get; }
public string StocksID { get; }
public string StockName { get; }
public string SecBrokerID { get; }
public string SecBrokerName { get; }
public double Price { get; }
public int BuyQty { get; }
public int CellQty { get; }
public DgvItems( string line )
{
var split = line.Split( ',' );
DealDate = split[0];
StocksID = split[1];
StockName = split[2];
SecBrokerID = split[3];
SecBrokerName = split[4];
Price = double.Parse( split[5] );
BuyQty = int.Parse( split[6] );
CellQty = int.Parse( split[7] );
}
}
And we found that if we add an extra ToArray()
before ToLookup()
like this:
static ILookup<string, DgvItems> ParseCsv( string fileName )
{
var file = File.ReadAllLines( fileName );
return file.Skip( 1 ).Select( line => new DgvItems( line ) ).ToArray().ToLookup( item => item.StocksID );
}
The latter is significantly faster. More specifically, when use test file with 1.4 million lines, the former takes around 4.3 seconds and the latter takes around 3 seconds.
I expect ToArray()
should take extra time so the latter should be slightly slower. Why is it actually faster?
Extra information:
We found this issue because there is another method that parse same .csv file to different format and it takes around 3 seconds so we think this one should be able to do the same thing in 3 seconds.
The original data type is Dictionary<string, List<DgvItems>>
and the original code didn't use linq and the result is similar.
BenchmarkDotNet test class:
public class TestClass
{
private readonly string[] Lines;
public TestClass()
{
Lines = File.ReadAllLines( @"D:\20110315_Random.csv" );
}
[Benchmark]
public ILookup<string, DgvItems> First()
{
return Lines.Skip( 1 ).Select( line => new DgvItems( line ) ).ToArray().ToLookup( item => item.StocksID );
}
[Benchmark]
public ILookup<string, DgvItems> Second()
{
return Lines.Skip( 1 ).Select( line => new DgvItems( line ) ).ToLookup( item => item.StocksID );
}
}
Result:
| Method | Mean | Error | StdDev |
|------- |--------:|---------:|---------:|
| First | 2.530 s | 0.0190 s | 0.0178 s |
| Second | 3.620 s | 0.0217 s | 0.0203 s |
I did another test base on original code. Seems that the problem is not on Linq.
public class TestClass
{
private readonly string[] Lines;
public TestClass()
{
Lines = File.ReadAllLines( @"D:\20110315_Random.csv" );
}
[Benchmark]
public Dictionary<string, List<DgvItems>> First()
{
List<DgvItems> itemList = new List<DgvItems>();
for ( int i = 1; i < Lines.Length; i++ )
{
itemList.Add( new DgvItems( Lines[i] ) );
}
Dictionary<string, List<DgvItems>> dictionary = new Dictionary<string, List<DgvItems>>();
foreach( var item in itemList )
{
if( dictionary.TryGetValue( item.StocksID, out var list ) )
{
list.Add( item );
}
else
{
dictionary.Add( item.StocksID, new List<DgvItems>() { item } );
}
}
return dictionary;
}
[Benchmark]
public Dictionary<string, List<DgvItems>> Second()
{
Dictionary<string, List<DgvItems>> dictionary = new Dictionary<string, List<DgvItems>>();
for ( int i = 1; i < Lines.Length; i++ )
{
var item = new DgvItems( Lines[i] );
if ( dictionary.TryGetValue( item.StocksID, out var list ) )
{
list.Add( item );
}
else
{
dictionary.Add( item.StocksID, new List<DgvItems>() { item } );
}
}
return dictionary;
}
}
Result:
| Method | Mean | Error | StdDev |
|------- |--------:|---------:|---------:|
| First | 2.470 s | 0.0218 s | 0.0182 s |
| Second | 3.481 s | 0.0260 s | 0.0231 s |
I managed to replicate the issue with the simplified code below:
var lookup = Enumerable.Range(0, 2_000_000)
.Select(i => ( (i % 1000).ToString(), i.ToString() ))
.ToArray() // +20% speed boost
.ToLookup(x => x.Item1);
It is important that the members of the created tuple are strings. Removing the two .ToString()
from the above code eliminates the advantage of ToArray
. The .NET Framework behaves a bit different than .NET Core, since it's enough to remove only the first .ToString()
to eliminate the observed difference.
I have no idea why this happens.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With