Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get random lines from large files in bash

How can I get n random lines from very large files that can't fit in memory.

Also it would be great if I could add filters before or after the randomization.


update 1

in my case the specs are :

  • > 100 million lines
  • > 10GB files
  • usual random batch size 10000-30000
  • 512RAM hosted ubuntu server 14.10

so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem

like image 749
Stefan Rogin Avatar asked Mar 17 '15 15:03

Stefan Rogin


People also ask

How do you get top 10 lines in Linux?

To look at the first few lines of a file, type head filename, where filename is the name of the file you want to look at, and then press <Enter>. By default, head shows you the first 10 lines of a file. You can change this by typing head -number filename, where number is the number of lines you want to see.

How do I shuffle in bash?

You can use the shuf command followed by the file you want to shuffle. In this case, the file's contents will get shuffled, and the output will be displayed on the standard output. You can use the syntax below.


1 Answers

In such limiting factors, the following approach will be better.

  • seek to random position in the file (e.g. you will be "inside" in some line)
  • go backward from this position and find the start of the given line
  • go forward and print the full line

For this you need a tool that can seek in files, for example perl.

use strict;
use warnings;
use Symbol;
use Fcntl qw( :seek O_RDONLY ) ;
my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256

my($want, $filename) = @ARGV;

my $fd = gensym ;
sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!");
binmode $fd;
my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!");

my $buffer;
my $cnt;
while($want > $cnt++) {
    my $randpos = int(rand($endpos));   #random file position
    my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before)
    $seekpos = 0 if( $seekpos < 0 );

    sysseek($fd, $seekpos, SEEK_SET);   #seek to position
    my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters

    my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer

    my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer
    my $lineend = index $buffer, "\n", $linestart;            #find the end of line in the buffer
    my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart;

    print "$the_line\n";
}

Save the above into some file such "randlines.pl" and use it as:

perl randlines.pl wanted_count_of_lines file_name

e.g.

perl randlines.pl 10000 ./BIGFILE

The script does very low-level IO operations, i.e. it is VERY FAST. (on my notebook, selecting 30k lines from 10M took half second).

like image 57
jm666 Avatar answered Sep 28 '22 06:09

jm666