Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to read a specific line number of a file. (BONUS: Python Manual Misprint)

Tags:

python

c#

.net

file

I have a 100 GB text file, which is a BCP dump from a database. When I try to import it with BULK INSERT, I get a cryptic error on line number 219506324. Before solving this issue I would like to see this line, but alas my favorite method of

import linecache
print linecache.getline(filename, linenumber)

is throwing a MemoryError. Interestingly the manual says that "This function will never throw an exception." On this large file it throws one as I try to read line number 1, and I have about 6GB free RAM...

I would like to know what is the most elegant method to get to that unreachable line. Available tools are Python 2, Python 3 and C# 4 (Visual Studio 2010). Yes, I understand that I can always do something like

var line = 0;
using (var stream = new StreamReader(File.OpenRead(@"s:\source\transactions.dat")))
{
     while (++line < 219506324) stream.ReadLine(); //waste some cycles
     Console.WriteLine(stream.ReadLine());
}

Which would work, but I doubt it's the most elegant way.

EDIT: I'm waiting to close this thread, because the hard drive containing the file is being used right now by another process. I'm going to test both suggested methods and report timings. Thank you all for your suggestions and comments.

The Results are in I implemented Gabes and Alexes methods to see which one was faster. If I'm doing anything wrong, do tell. I'm going for the 10 millionth line in my 100GB file using the method Gabe suggested and then using the method Alex suggested which i loosely translated into C#... The only thing I'm adding from myself, is first reading in a 300 MB file into memory just to clear the HDD cache.

const string file = @"x:\....dat"; // 100 GB file
const string otherFile = @"x:\....dat"; // 300 MB file
const int linenumber = 10000000;

ClearHDDCache(otherFile);
GabeMethod(file, linenumber);  //Gabe's method

ClearHDDCache(otherFile);
AlexMethod(file, linenumber);  //Alex's method

// Results
// Gabe's method: 8290 (ms)
// Alex's method: 13455 (ms)

The implementation of gabe's method is as follows:

var gabe = new Stopwatch();
gabe.Start();
var data = File.ReadLines(file).ElementAt(linenumber - 1);
gabe.Stop();
Console.WriteLine("Gabe's method: {0} (ms)",  gabe.ElapsedMilliseconds);

While Alex's method is slightly tricker:

var alex = new Stopwatch();
alex.Start();
const int buffersize = 100 * 1024; //bytes
var buffer = new byte[buffersize];
var counter = 0;
using (var filestream = File.OpenRead(file))
{
    while (true) // Cutting corners here...
    {
        filestream.Read(buffer, 0, buffersize);
        //At this point we could probably launch an async read into the next chunk...
        var linesread = buffer.Count(b => b == 10); //10 is ASCII linebreak.
        if (counter + linesread >= linenumber) break;
        counter += linesread;
    }
}
//The downside of this method is that we have to assume that the line fit into the buffer, or do something clever...er
var data = new ASCIIEncoding().GetString(buffer).Split('\n').ElementAt(linenumber - counter - 1);
alex.Stop();
Console.WriteLine("Alex's method: {0} (ms)", alex.ElapsedMilliseconds);

So unless Alex cares to comment I'll mark Gabe's solution as accepted.

like image 577
Gleno Avatar asked Aug 28 '10 03:08

Gleno


1 Answers

Here's my elegant version in C#:

Console.Write(File.ReadLines(@"s:\source\transactions.dat").ElementAt(219506323));

or more general:

Console.Write(File.ReadLines(filename).ElementAt(linenumber - 1));

Of course, you may want to show some context before and after the given line:

Console.Write(string.Join("\n",
              File.ReadLines(filename).Skip(linenumber - 5).Take(10)));

or more fluently:

File
.ReadLines(filename)
.Skip(linenumber - 5)
.Take(10)
.AsObservable()
.Do(Console.WriteLine);

BTW, the linecache module does not do anything clever with large files. It just reads the whole thing in, keeping it all in memory. The only exceptions it catches are I/O-related (can't access file, file not found, etc.). Here's the important part of the code:

    fp = open(fullname, 'rU')
    lines = fp.readlines()
    fp.close()

In other words, it's trying to fit the whole 100GB file into 6GB of RAM! What the manual should say is maybe "This function will never throw an exception if it can't access the file."

like image 74
Gabe Avatar answered Sep 28 '22 04:09

Gabe