Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read random line from a large text file

Tags:

c#

text-files

I have a file with 5000+ lines. I want to find the most efficient way to choose one of those lines each time I run my program. I had originally intended to use the random method to choose one (that was before I knew there were 5000 lines). Thought that might be inefficient so I thought I'd look at reading the first line, then deleting it from the top and appending it to the bottom. But it seems that I have to read the whole file and create a new file to delete from the top.

What is the most efficient way: the random method or the new file method?

The program will be run every 5 mins and I'm using c# 4.5

like image 918
Angela Baines Avatar asked Sep 01 '15 09:09

Angela Baines


2 Answers

In .NET 4.*, it is possible to access a single line of a file directly. For example, to get line X:

string line = File.ReadLines(FileName).Skip(X).First();

Full example:

var fileName = @"C:\text.txt"
var file = File.ReadLines(fileName).ToList();
int count = file.Count();
Random rnd = new Random();
int skip = rnd.Next(0, count);
string line = file.Skip(skip).First();
Console.WriteLine(line);
like image 152
randoms Avatar answered Sep 18 '22 06:09

randoms


Lets assume file is so large that you cannot afford to fit it into RAM. Then, you would want to use Reservoir Sampling, an algorithm designed to handle picking randomly from lists of unknown, arbitrary length that might not fit into memory:

Random r = new Random();
int currentLine = 1;
string pick = null;
foreach (string line in File.ReadLines(filename)) 
{
    if (r.Next(currentLine) == 0) {
        pick = line;
    }
    ++currentLine;
}   
return pick;

At a high level, reservoir sampling follows a basic rule: Each further line has a 1/N chance of replacing all previous lines.

This algorithm is slightly unintuitive. At a high level, it works by having line N have a 1/N chance of replacing the currently selected line. Thus, line 1 has a 100% chance of being selected, but a 50% chance of later being replaced by line 2.

I've found understanding this algorithm to be easiest in the form of a proof of correctness. So, a simple proof by induction:

1) Base case: By inspection, the algorithm works if there is 1 line.
2) If the algorithm works for N-1 lines, processing N lines works because:
3) After processing N-1 iterations of an N line file, all N-1 lines are equally likely (probability 1/(N-1)).
4) The next iteration insures that line N has a probability of 1/N (because that's what the algorithm explicitly assigns it, and it is the final iteration), reducing the probability of all previous lines to:

1/(N-1) * (1-(1/N))  
1/(N-1) * (N/N-(1/N))  
1/(N-1) * (N-1)/N  
(1*(N-1)) / (N*(N-1))  
1/N

If you know how many lines are in the file in advance, this algorithm is more expensive than necessary, as it always reads the entire file.

like image 26
Brian Avatar answered Sep 22 '22 06:09

Brian