How can I get <code>n</code> random lines from very large files that can't fit in memory. Also it would be great if I could add filters before or after the randomization. <hr> <h3>update 1</h3> in my case the specs are : <ul> <li>> 100 million lines</li> <li>> 10GB files</li> <li>usual random batch size 10000-30000</li> <li>512RAM hosted ubuntu server 14.10</li> </ul> so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem

In such limiting factors, the following approach will be better. <ul> <li>seek to random position in the file (e.g. you will be "inside" in some line)</li> <li>go backward from this position and find the start of the given line</li> <li>go forward and print the full line</li> </ul> For this you need a tool that can seek in files, for example <code>perl</code>. <pre class="prettyprint"><code>use strict; use warnings; use Symbol; use Fcntl qw( :seek O_RDONLY ) ; my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256 my($want, $filename) = @ARGV; my $fd = gensym ; sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!"); binmode $fd; my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!"); my $buffer; my $cnt; while($want > $cnt++) { my $randpos = int(rand($endpos)); #random file position my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before) $seekpos = 0 if( $seekpos < 0 ); sysseek($fd, $seekpos, SEEK_SET); #seek to position my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer my $lineend = index $buffer, "\n", $linestart; #find the end of line in the buffer my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart; print "$the_line\n"; } </code></pre> Save the above into some file such "randlines.pl" and use it as: <pre class="prettyprint"><code>perl randlines.pl wanted_count_of_lines file_name </code></pre> e.g. <pre class="prettyprint"><code>perl randlines.pl 10000 ./BIGFILE </code></pre> The script does very low-level IO operations, i.e. it is VERY FAST. (on my notebook, selecting 30k lines from 10M took half second).

Get random lines from large files in bash

How can I get n random lines from very large files that can't fit in memory.

Also it would be great if I could add filters before or after the randomization.

update 1

in my case the specs are :

> 100 million lines
> 10GB files
usual random batch size 10000-30000
512RAM hosted ubuntu server 14.10

so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem

How do you get top 10 lines in Linux?

To look at the first few lines of a file, type head filename, where filename is the name of the file you want to look at, and then press <Enter>. By default, head shows you the first 10 lines of a file. You can change this by typing head -number filename, where number is the number of lines you want to see.

How do I shuffle in bash?

You can use the shuf command followed by the file you want to shuffle. In this case, the file's contents will get shuffled, and the output will be displayed on the standard output. You can use the syntax below.

In such limiting factors, the following approach will be better.

seek to random position in the file (e.g. you will be "inside" in some line)
go backward from this position and find the start of the given line
go forward and print the full line

For this you need a tool that can seek in files, for example perl.

use strict;
use warnings;
use Symbol;
use Fcntl qw( :seek O_RDONLY ) ;
my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256

my($want, $filename) = @ARGV;

my $fd = gensym ;
sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!");
binmode $fd;
my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!");

my $buffer;
my $cnt;
while($want > $cnt++) {
    my $randpos = int(rand($endpos));   #random file position
    my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before)
    $seekpos = 0 if( $seekpos < 0 );

    sysseek($fd, $seekpos, SEEK_SET);   #seek to position
    my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters

    my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer

    my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer
    my $lineend = index $buffer, "\n", $linestart;            #find the end of line in the buffer
    my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart;

    print "$the_line\n";
}

Save the above into some file such "randlines.pl" and use it as:

perl randlines.pl wanted_count_of_lines file_name

e.g.

perl randlines.pl 10000 ./BIGFILE

The script does very low-level IO operations, i.e. it is VERY FAST. (on my notebook, selecting 30k lines from 10M took half second).

Get random lines from large files in bash

Tags:

bash

command-line

random-sample

line-processing

update 1

Stefan Rogin

People also ask

1 Answers

jm666

Recent Activity

Donate For Us

Get random lines from large files in bash

Tags:

bash

command-line

random-sample

line-processing

update 1

Stefan Rogin

People also ask

1 Answers

jm666

Related questions

Recent Activity

Donate For Us