How can I get n
random lines from very large files that can't fit in memory.
Also it would be great if I could add filters before or after the randomization.
in my case the specs are :
so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem
To look at the first few lines of a file, type head filename, where filename is the name of the file you want to look at, and then press <Enter>. By default, head shows you the first 10 lines of a file. You can change this by typing head -number filename, where number is the number of lines you want to see.
You can use the shuf command followed by the file you want to shuffle. In this case, the file's contents will get shuffled, and the output will be displayed on the standard output. You can use the syntax below.
In such limiting factors, the following approach will be better.
For this you need a tool that can seek in files, for example perl
.
use strict;
use warnings;
use Symbol;
use Fcntl qw( :seek O_RDONLY ) ;
my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256
my($want, $filename) = @ARGV;
my $fd = gensym ;
sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!");
binmode $fd;
my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!");
my $buffer;
my $cnt;
while($want > $cnt++) {
my $randpos = int(rand($endpos)); #random file position
my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before)
$seekpos = 0 if( $seekpos < 0 );
sysseek($fd, $seekpos, SEEK_SET); #seek to position
my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters
my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer
my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer
my $lineend = index $buffer, "\n", $linestart; #find the end of line in the buffer
my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart;
print "$the_line\n";
}
Save the above into some file such "randlines.pl" and use it as:
perl randlines.pl wanted_count_of_lines file_name
e.g.
perl randlines.pl 10000 ./BIGFILE
The script does very low-level IO operations, i.e. it is VERY FAST. (on my notebook, selecting 30k lines from 10M took half second).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With