I want to randomly 80/20 split a file using awk.
I have read and tried the option found HERE in which something like the following proposed:
$ awk -v N=`cat FILE | wc -l` 'rand()<3000/N' FILE
works great if you want a random selection.
However, is it possible to alter this awk in order to split the one file into two files of 80/20 (or any other) proportion?
With gawk, you'd write
gawk '
BEGIN {srand()}
{f = FILENAME (rand() <= 0.8 ? ".80" : ".20"); print > f}
' file
Example:
seq 100 > 100.txt
gawk 'BEGIN {srand()} {f = FILENAME (rand() <= 0.8 ? ".80" : ".20"); print > f}' 100.txt
wc -l 100.txt*
100 100.txt
23 100.txt.20
77 100.txt.80
200 total
To ensure 20 lines in the "20" file:
$ paste -d $'\034' <(seq $(wc -l < "$file") | sort -R) "$file" \
| awk -F $'\034' -v file="$file" '{
f = file ($1 <= 20 ? ".20" : ".80")
print $2 > f
}'
$ wc -l "$file"*
100 testfile
20 testfile.20
80 testfile.80
200 total
\034
is the ASCII FS
character, unlikely to appear in a text file.
sort -R
to shuffle the input may not be portable. It's in GNU and BSD sort though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With