Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly Pick Lines From a File Without Slurping It With Unix

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC memory cannot handle such slurps. Is there other approach to do it?

awk 'BEGIN{srand()} !/^$/{ a[c++]=$0} END {     for ( i=1;i<=c ;i++ )  {      num=int(rand() * c)     if ( a[num] ) {         print a[num]         delete a[num]         d++     }     if ( d == c/100 ) break   }  }' file 
like image 361
neversaint Avatar asked Mar 28 '09 05:03

neversaint


1 Answers

if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?

In that second case, just randomize at 1% at each line...

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}' 

If you'd like the header line plus a random sample of lines after, use:

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}' 
like image 80
cadrian Avatar answered Sep 22 '22 23:09

cadrian