I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files. How can i select random files and random entry(pair of line, excluding header) ?
If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.
Here's an intuitive explanation for the algorithm.
Process the file line by line.
pick = line, with probability 1/N, where N = line number
In other words, on line 1, we will pick line 1 with 1/1
probability. On line 2, we will change the pick to line 2, with 1/2
probability. On line 3, we will change the pick to line 3, with 1/3
probability. Etc.
For an intuitive proof, imagine a file with 3 lines:
1 Pick line 1.
/ \
.5 .5
/ \
2 1 Switch to line 2?
/ \ / \
.67 .33 .33 .67
/ \ / \
2 3 1 Switch to line 3?
The probability for each outcome:
Line 1: .5 * .67 = 1/3
Line 2: .5 * .67 = 1/3
Line 3: .5 * .33 * 2 = 1/3
From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4
chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4
).
Here's the Perl FAQ answer, adapted to your problem.
use strict;
use warnings;
# Ignore 5 lines.
<> for 1 .. 5;
# Use reservoir sampling to select pairs from remaining lines.
my (@picks, $n);
until (eof){
my @lines;
$lines[$_] = <> for 0 .. 1;
$n ++;
@picks = @lines if rand($n) < 1;
}
print @picks;
You may find perlfaq5 useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With