Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly selecting lines from files

Tags:

python

shell

perl

I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files. How can i select random files and random entry(pair of line, excluding header) ?

like image 330
AlgoMan Avatar asked Jun 09 '10 20:06

AlgoMan


2 Answers

If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.

Here's an intuitive explanation for the algorithm.

Process the file line by line.
pick = line, with probability 1/N, where N = line number

In other words, on line 1, we will pick line 1 with 1/1 probability. On line 2, we will change the pick to line 2, with 1/2 probability. On line 3, we will change the pick to line 3, with 1/3 probability. Etc.

For an intuitive proof, imagine a file with 3 lines:

        1            Pick line 1.
       / \
     .5  .5
     /     \
    2       1        Switch to line 2?
   / \     / \
 .67 .33 .33 .67
 /     \ /     \
2       3       1    Switch to line 3?

The probability for each outcome:

Line 1: .5 * .67     = 1/3
Line 2: .5 * .67     = 1/3
Line 3: .5 * .33 * 2 = 1/3

From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4 chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4).

Here's the Perl FAQ answer, adapted to your problem.

use strict;
use warnings;

# Ignore 5 lines.
<> for 1 .. 5;

# Use reservoir sampling to select pairs from remaining lines.
my (@picks, $n);
until (eof){
    my @lines;
    $lines[$_] = <> for 0 .. 1;

    $n ++;
    @picks = @lines if rand($n) < 1;
}

print @picks;
like image 87
FMc Avatar answered Oct 04 '22 03:10

FMc


You may find perlfaq5 useful.

like image 21
Eugene Yarmash Avatar answered Oct 04 '22 02:10

Eugene Yarmash