Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sampling without replacement using awk

Tags:

bash

shell

awk

I have a lot of text files that look like this:

>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
>HLGKAHOLAGGATACCATAGATGGCACGCCCT
>DLGKAHOLAGGATACCATAGATGGCACGCCCT
>ELGKAHOLAGGATACCATAGATGGCACGCCCT
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>JGGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT

Is there a way to do a sampling without replacement using awk?

For example, I have this 8 lines, and I only want to sample 4 of these randomly in a new file, without replacement. The output should look something like this:

>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT    
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT

Thanks in advance

like image 860
JM88 Avatar asked Mar 10 '14 15:03

JM88


2 Answers

How about this for a random sampling of 10% of your lines?

awk 'rand()>0.9' yourfile1 yourfile2 anotherfile

I am not sure what you mean by "replacement"... there is no replacement occurring here, just random selection.

Basically, it looks at each line of each file precisely once and generates a random number on the interval 0 to 1. If the random number is greater than 0.9, the line is output. So basically it is rolling a 10 sided dice for each line and only printing it if the dice comes up as 10. No chance of a line being printed twice - unless it occurs twice in your files, of course.

For added randomness (!) you can add an srand() at the start as suggested by @klashxx

awk 'BEGIN{srand()} rand()>0.9' yourfile(s)
like image 125
Mark Setchell Avatar answered Oct 13 '22 01:10

Mark Setchell


Yes, but I wouldn't. I would use shuf or sort -R (neither POSIX) to randomize the file and then select the first n lines using head.

If you really want to use awk for this, you would need to use the rand function, as Mark Setchell points out.

like image 23
kojiro Avatar answered Oct 13 '22 01:10

kojiro