Following up on this question, I need to get exactly n
lines at random out of a file (or stdin
). This would be similar to head
or tail
, except I want some from the middle.
Now, other than looping over the file with the solutions to the linked question, what's the best way to get exactly n
lines in one run?
For reference, I tried this:
#!/usr/bin/perl -w
use strict;
my $ratio = shift;
print $ratio, "\n";
while () {
print if ((int rand $ratio) == 1);
}
where $ratio
is the rough percentage of lines I want. For instance, if I want 1 in 10 lines:
random_select 10 a.list
However, this doesn't give me an exact amount:
aaa> foreach i ( 0 1 2 3 4 5 6 7 8 9 )
foreach? random_select 10 a.list | wc -l
foreach? end
4739
4865
4739
4889
4934
4809
4712
4842
4814
4817
The other thought I had was slurping the input file and then choosing n
at random from the array, but that's a problem if I have a really big file.
Any ideas?
Edit: This is an exact duplicate of this question.
Here's a nice one-pass algorithm that I just came up with, having O(N) time complexity and O(M) space complexity, for reading M lines from an N-line file.
Assume M <= N.
S
be the set of chosen lines. Initialize S
to the first M
lines of the file. If the ordering of the final result is important, shuffle S
now.l
. So far, we have read n = M + 1
total lines. The probability that we want to choose l
as one of our final lines is therefore M/n
.l
with probability M/n
; use a RNG to decide whether to accept or reject l
.l
has been accepted, randomly choose one of the lines in S
and replace it with l
.n
with each new line read.S
of chosen lines.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With