Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get exactly n random lines from a file with Perl?

Following up on this question, I need to get exactly n lines at random out of a file (or stdin). This would be similar to head or tail, except I want some from the middle.

Now, other than looping over the file with the solutions to the linked question, what's the best way to get exactly n lines in one run?

For reference, I tried this:

#!/usr/bin/perl -w
use strict;
my $ratio = shift;
print $ratio, "\n";
while () {
    print if ((int rand $ratio) == 1); 
}

where $ratio is the rough percentage of lines I want. For instance, if I want 1 in 10 lines:

random_select 10 a.list

However, this doesn't give me an exact amount:

aaa> foreach i ( 0 1 2 3 4 5 6 7 8 9 )
foreach? random_select 10 a.list | wc -l
foreach? end
4739
4865
4739
4889
4934
4809
4712
4842
4814
4817

The other thought I had was slurping the input file and then choosing n at random from the array, but that's a problem if I have a really big file.

Any ideas?

Edit: This is an exact duplicate of this question.

like image 536
Nathan Fellman Avatar asked May 13 '09 07:05

Nathan Fellman


1 Answers

Here's a nice one-pass algorithm that I just came up with, having O(N) time complexity and O(M) space complexity, for reading M lines from an N-line file.

Assume M <= N.

  1. Let S be the set of chosen lines. Initialize S to the first M lines of the file. If the ordering of the final result is important, shuffle S now.
  2. Read in the next line l. So far, we have read n = M + 1 total lines. The probability that we want to choose l as one of our final lines is therefore M/n.
  3. Accept l with probability M/n; use a RNG to decide whether to accept or reject l.
  4. If l has been accepted, randomly choose one of the lines in S and replace it with l.
  5. Repeat steps 2-4 until the file has been exhausted of lines, incrementing n with each new line read.
  6. Return the set S of chosen lines.
like image 130
kquinn Avatar answered Oct 11 '22 19:10

kquinn