Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

random split files with specific proportion

Tags:

random

split

awk

I want to randomly 80/20 split a file using awk.

I have read and tried the option found HERE in which something like the following proposed:

$ awk -v N=`cat FILE | wc -l` 'rand()<3000/N' FILE

works great if you want a random selection.

However, is it possible to alter this awk in order to split the one file into two files of 80/20 (or any other) proportion?

like image 841
owwoow14 Avatar asked Dec 19 '22 22:12

owwoow14


1 Answers

With gawk, you'd write

gawk '
    BEGIN {srand()}
    {f = FILENAME (rand() <= 0.8 ? ".80" : ".20"); print > f}
' file

Example:

seq 100 > 100.txt
gawk 'BEGIN {srand()} {f = FILENAME (rand() <= 0.8 ? ".80" : ".20"); print > f}' 100.txt
wc -l 100.txt*
100 100.txt
 23 100.txt.20
 77 100.txt.80
200 total

To ensure 20 lines in the "20" file:

$ paste -d $'\034' <(seq $(wc -l < "$file") | sort -R) "$file" \
| awk -F $'\034' -v file="$file" '{
    f = file ($1 <= 20 ? ".20" : ".80")
    print $2 > f
}'

$ wc -l "$file"*
100 testfile
 20 testfile.20
 80 testfile.80
200 total

\034 is the ASCII FS character, unlikely to appear in a text file.

sort -R to shuffle the input may not be portable. It's in GNU and BSD sort though.

like image 109
glenn jackman Avatar answered Dec 30 '22 18:12

glenn jackman