Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it faster if I put an extra ToArray before ToLookup?

Tags:

c#

linq

We have a short method that parse .csv file to a lookup:

ILookup<string, DgvItems> ParseCsv( string fileName )
{
    var file = File.ReadAllLines( fileName );
    return file.Skip( 1 ).Select( line => new DgvItems( line ) ).ToLookup( item => item.StocksID );
}

And the definition of DgvItems:

public class DgvItems
{
    public string DealDate { get; }

    public string StocksID { get; }

    public string StockName { get; }

    public string SecBrokerID { get; }

    public string SecBrokerName { get; }

    public double Price { get; }

    public int BuyQty { get; }

    public int CellQty { get; }

    public DgvItems( string line )
    {
        var split = line.Split( ',' );
        DealDate = split[0];
        StocksID = split[1];
        StockName = split[2];
        SecBrokerID = split[3];
        SecBrokerName = split[4];
        Price = double.Parse( split[5] );
        BuyQty = int.Parse( split[6] );
        CellQty = int.Parse( split[7] );
    }
}

And we found that if we add an extra ToArray() before ToLookup() like this:

static ILookup<string, DgvItems> ParseCsv( string fileName )
{
    var file = File.ReadAllLines( fileName  );
    return file.Skip( 1 ).Select( line => new DgvItems( line ) ).ToArray().ToLookup( item => item.StocksID );
}

The latter is significantly faster. More specifically, when use test file with 1.4 million lines, the former takes around 4.3 seconds and the latter takes around 3 seconds.

I expect ToArray() should take extra time so the latter should be slightly slower. Why is it actually faster?


Extra information:

  1. We found this issue because there is another method that parse same .csv file to different format and it takes around 3 seconds so we think this one should be able to do the same thing in 3 seconds.

  2. The original data type is Dictionary<string, List<DgvItems>> and the original code didn't use linq and the result is similar.


BenchmarkDotNet test class:

public class TestClass
{
    private readonly string[] Lines;

    public TestClass()
    {
        Lines = File.ReadAllLines( @"D:\20110315_Random.csv" );
    }

    [Benchmark]
    public ILookup<string, DgvItems> First()
    {
        return Lines.Skip( 1 ).Select( line => new DgvItems( line ) ).ToArray().ToLookup( item => item.StocksID );
    }

    [Benchmark]
    public ILookup<string, DgvItems> Second()
    {
        return Lines.Skip( 1 ).Select( line => new DgvItems( line ) ).ToLookup( item => item.StocksID );
    }
}

Result:

| Method |    Mean |    Error |   StdDev |
|------- |--------:|---------:|---------:|
|  First | 2.530 s | 0.0190 s | 0.0178 s |
| Second | 3.620 s | 0.0217 s | 0.0203 s |

I did another test base on original code. Seems that the problem is not on Linq.

public class TestClass
{
    private readonly string[] Lines;

    public TestClass()
    {
        Lines = File.ReadAllLines( @"D:\20110315_Random.csv" );
    }

    [Benchmark]
    public Dictionary<string, List<DgvItems>> First()
    {
        List<DgvItems> itemList = new List<DgvItems>();
        for ( int i = 1; i < Lines.Length; i++ )
        {
            itemList.Add( new DgvItems( Lines[i] ) );
        }

        Dictionary<string, List<DgvItems>> dictionary = new Dictionary<string, List<DgvItems>>();

        foreach( var item in itemList )
        {
            if( dictionary.TryGetValue( item.StocksID, out var list ) )
            {
                list.Add( item );
            }
            else
            {
                dictionary.Add( item.StocksID, new List<DgvItems>() { item } );
            }
        }

        return dictionary;
    }

    [Benchmark]
    public Dictionary<string, List<DgvItems>> Second()
    {
        Dictionary<string, List<DgvItems>> dictionary = new Dictionary<string, List<DgvItems>>();
        for ( int i = 1; i < Lines.Length; i++ )
        {
            var item = new DgvItems( Lines[i] );

            if ( dictionary.TryGetValue( item.StocksID, out var list ) )
            {
                list.Add( item );
            }
            else
            {
                dictionary.Add( item.StocksID, new List<DgvItems>() { item } );
            }
        }

        return dictionary;
    }
}

Result:

| Method |    Mean |    Error |   StdDev |
|------- |--------:|---------:|---------:|
|  First | 2.470 s | 0.0218 s | 0.0182 s |
| Second | 3.481 s | 0.0260 s | 0.0231 s |
like image 926
Leisen Chang Avatar asked Nov 21 '19 06:11

Leisen Chang


1 Answers

I managed to replicate the issue with the simplified code below:

var lookup = Enumerable.Range(0, 2_000_000)
    .Select(i => ( (i % 1000).ToString(), i.ToString() ))
    .ToArray() // +20% speed boost
    .ToLookup(x => x.Item1);

It is important that the members of the created tuple are strings. Removing the two .ToString() from the above code eliminates the advantage of ToArray. The .NET Framework behaves a bit different than .NET Core, since it's enough to remove only the first .ToString() to eliminate the observed difference.

I have no idea why this happens.

like image 197
Theodor Zoulias Avatar answered Nov 03 '22 00:11

Theodor Zoulias