I have a lot of text files that look like this:
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
>HLGKAHOLAGGATACCATAGATGGCACGCCCT
>DLGKAHOLAGGATACCATAGATGGCACGCCCT
>ELGKAHOLAGGATACCATAGATGGCACGCCCT
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>JGGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT
Is there a way to do a sampling without replacement using awk?
For example, I have this 8 lines, and I only want to sample 4 of these randomly in a new file, without replacement. The output should look something like this:
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
Thanks in advance
How about this for a random sampling of 10% of your lines?
awk 'rand()>0.9' yourfile1 yourfile2 anotherfile
I am not sure what you mean by "replacement"... there is no replacement occurring here, just random selection.
Basically, it looks at each line of each file precisely once and generates a random number on the interval 0 to 1. If the random number is greater than 0.9, the line is output. So basically it is rolling a 10 sided dice for each line and only printing it if the dice comes up as 10. No chance of a line being printed twice - unless it occurs twice in your files, of course.
For added randomness (!) you can add an srand()
at the start as suggested by @klashxx
awk 'BEGIN{srand()} rand()>0.9' yourfile(s)
Yes, but I wouldn't. I would use shuf
or sort -R
(neither POSIX) to randomize the file and then select the first n
lines using head
.
If you really want to use awk
for this, you would need to use the rand
function, as Mark Setchell points out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With