Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does File.ReadLines(file).Skip(numLines) work?

Tags:

c#

I have a rather large file that I wish to read from a particular line. I found

File.ReadLines(file).Skip(numLines);

which works great. However, I do not understand how this works underneath the surface. I wrote a couple of basic benchmarks to see if there was a performance difference from the way some colleagues had suggested. The methods I tested were:

  1. StreamReader used to read through all lines up to that point:

    public string streamToLine(int lineNumber)
    {
        StreamReader reader = new StreamReader(fileName);
    
        for (int i = 0; i < lineNumber - 1; i++)
        {
            reader.ReadLine();
        }
    
        string line = reader.ReadLine();
        reader.Close();
    
        return line;
    }
    
  2. File.ReadLines(file) and iterating to the line with an enumerator:

    public string readToLine(int lineNumber)
    {
        IEnumerator<string> lines = File.ReadLines(fileName).GetEnumerator();
        for (int i = 0; i < lineNumber; i++)
        {
            lines.MoveNext();
        }
        return lines.ToString();
    }
    
  3. Using the Skip functionality:

    public string skipToLine(int lineNumber)
    {
        IEnumerator<string> lines = File.ReadLines(fileName).Skip(lineNumber-1).GetEnumerator();
    
        return lines.ToString();
    }
    

I ran the tests 10 times over a file with 10 million lines, attempting to read the 9 millionth line and averaged how long this took in milliseconds:

Stream To Line : 2442.1

Read To Line : 2534.9

Skip To Line : 0

It looks like Skip does not even consider the other lines before lineNumber and knows exactly where the 9 millionth line is. Does it somehow infer this from the file? Is there some overhead in the way the other 2 methods process the lines because they are returning what is read? How is there such a big difference?

like image 360
Kieran Bristow Avatar asked Feb 12 '23 02:02

Kieran Bristow


1 Answers

Basically, the problem is your test. You haven't called MoveNext() on the enumerator, so it hasn't done anything yet. Iterators are often deferred and streaming, especially in the case of LINQ.

Incidentally, it is very rare that you need to use GetEnumerator(); the idiomatic way to access such data is via foreach.

If you want to see this in action:

    static void Main()
    {
        using(var iter = GetData().GetEnumerator())
        {
            System.Console.WriteLine("Have iterator");
            while(iter.MoveNext())
            {
                System.Console.WriteLine(iter.Current);
            }
            System.Console.WriteLine("Done");
        }
    }
    static IEnumerable<int> GetData()
    {
        System.Console.WriteLine("Before doing anything");
        yield return 1;
        yield return 2;
        yield return 3;
        System.Console.WriteLine("Ater doing everything ");
    }

You should notice that "Have iterator" is written before "Before doing anything", which tells us that one can have an iterator that hasn't done anything yet. It is the first MoveNext() that makes it print.

like image 123
Marc Gravell Avatar answered Feb 13 '23 21:02

Marc Gravell