Randomly distribute files into train/test given a ratio

Question

I am at the moment trying make a setup script, capable of setting up a workspace up for me, such that I don't need to do it manually. I started doing this in bash, but quickly realized that would not work that well.

My next idea was to do it using python, but can't seem to do it a proper way.. My idea was to make a list (a list being a .txt files with the paths for all the datafiles), shuffle this list, and then move each file to either my train dir or test dir, given the ratio....

But this is python, isn't there a more simpler way to do it, it seems like I am doing an unessesary workaround just to split the files.

Bash Code:

# Partition data randomly into train and test. 
cd ${PATH_TO_DATASET}
SPLIT=0.5 #train/test split
NUMBER_OF_FILES=$(ls ${PATH_TO_DATASET} |  wc -l) ## number of directories in the dataset
even=1
echo ${NUMBER_OF_FILES}

if [ `echo "${NUMBER_OF_FILES} % 2" | bc` -eq 0 ]
then    
        even=1
        echo "Even is true"
else
        even=0
        echo "Even is false"
fi

echo -e "${BLUE}Seperating files in to train and test set!${NC}"

for ((i=1; i<=${NUMBER_OF_FILES}; i++))
do
    ran=$(python -c "import random;print(random.uniform(0.0, 1.0))")    
    if [[ ${ran} < ${SPLIT} ]]
    then 
        ##echo "test ${ran}"
        cp -R  $(ls -d */|sed "${i}q;d") ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data/test/
    else
        ##echo "train ${ran}"       
        cp -R  $(ls -d */|sed "${i}q;d") ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data/train/
    fi

    ##echo $(ls -d */|sed "${i}q;d")
done    

cd ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data
NUMBER_TRAIN_FILES=$(ls train/ |  wc -l)
NUMBER_TEST_FILES=$(ls test/ |  wc -l)

echo "${NUMBER_TRAIN_FILES} and ${NUMBER_TEST_FILES}..."
echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})

if [[ ${even} = 1  ]] && [[ ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES} != ${SPLIT} ]]
    then 
    echo "Something need to be fixed!"
    if [[  $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES}) > ${SPLIT} ]]
    then
        echo "Too many files in the TRAIN set move some to TEST"
        cd train
        echo $(pwd)
        while [[ ${NUMBER_TRAIN_FILES}/${NUMBER_TEST_FILES} != ${SPLIT} ]]
        do
            mv $(ls -d */|sed "1q;d") ../test/
            echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
        done
    else
        echo "Too many files in the TEST set move some to TRAIN"
        cd test
        while [[ ${NUMBER_TRAIN_FILES}/${NUMBER_TEST_FILES} != ${SPLIT} ]]
        do
            mv $(ls -d */|sed "1q;d") ../train/
            echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
        done
    fi

fi

My problem were the last part. Since i picking the numbers by random, I would not be sure that the data would be partitioned as hoped, which my last if statement were to check whether the partition was done right, and if not then fix it.. This was not possible since i am checking floating points, and the solution in general became more like a quick fix.

alvas · Accepted Answer

scikit-learn comes to the rescue =)

>>> import numpy as np
>>> from sklearn.cross_validation import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> y
[0, 1, 2, 3, 4]


# If i want 1/4 of the data for testing 
# and i set a random seed of 42.
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_train
[2, 0, 3]
>>> y_test
[1, 4]

See http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

To demonstrate:

alvas@ubi:~$ mkdir splitfileproblem
alvas@ubi:~$ cd splitfileproblem/
alvas@ubi:~/splitfileproblem$ mkdir original
alvas@ubi:~/splitfileproblem$ mkdir train
alvas@ubi:~/splitfileproblem$ mkdir test
alvas@ubi:~/splitfileproblem$ ls
original  train  test
alvas@ubi:~/splitfileproblem$ cd original/
alvas@ubi:~/splitfileproblem/original$ ls
alvas@ubi:~/splitfileproblem/original$ echo 'abc' > a.txt
alvas@ubi:~/splitfileproblem/original$ echo 'def
ghi' > b.txt
alvas@ubi:~/splitfileproblem/original$ cat a.txt 
abc
alvas@ubi:~/splitfileproblem/original$ echo -e 'def
ghi' > b.txt
alvas@ubi:~/splitfileproblem/original$ cat b.txt 
def
ghi
alvas@ubi:~/splitfileproblem/original$ echo -e 'jkl' > c.txt
alvas@ubi:~/splitfileproblem/original$ echo -e 'mno' > d.txt
alvas@ubi:~/splitfileproblem/original$ ls
a.txt  b.txt  c.txt  d.txt

In Python:

alvas@ubi:~/splitfileproblem$ ls
original  test  train
alvas@ubi:~/splitfileproblem$ python
Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> from sklearn.cross_validation import train_test_split
>>> os.listdir('original')
['b.txt', 'd.txt', 'c.txt', 'a.txt']
>>> X = y= os.listdir('original')
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
>>> X_train
['a.txt', 'd.txt', 'b.txt']
>>> X_test
['c.txt']

Now move the files:

>>> for x in X_train:
...     os.rename('original/'+x , 'train/'+x)
... 
>>> for x in X_test:
...     os.rename('original/'+x , 'test/'+x)
... 
>>> os.listdir('test')
['c.txt']
>>> os.listdir('train')
['b.txt', 'd.txt', 'a.txt']
>>> os.listdir('original')
[]

See also: How to move a file in Python

ghoti · Answer

Here's a simple example that uses bash's $RANDOM to move things to one of two target directories.

$ touch {1..10}
$ mkdir red blue
$ a=(*/)
$ RANDOM=$$
$ for f in [0-9]*; do mv -v "$f" "${a[$((RANDOM/(32768/${#a[@]})))]}"; done
1 -> red/1
10 -> red/10
2 -> blue/2
3 -> red/3
4 -> red/4
5 -> red/5
6 -> red/6
7 -> blue/7
8 -> blue/8
9 -> blue/9

This example starts with the creation of 10 files and two target directories. It sets an array to */ which expands to "all the directories within the current directory". It then runs a for loop with what looks like line noise in it. I'll break it apart for ya.

"${a[$((RANDOM/(32768/${#a[@]})+1))]}" is:

${a[ ... the array "a",
$((...)) ... whose subscript is an integer math function.
$RANDOM is a bash variable that generates a ramdom(ish) number from 0 to 32767, and our formula divides the denominator of that ratio by:
${#a[@]}, effectively multiplying RANDOM/32768 by the number of elements in the array "a".

The result of all this is that we pick a random array element, a.k.a. a random directory.

If you really want to work from your "list of files", and assuming you leave your list of potential targets in the array "a", you could replace the for loop with a while loop:

while read f; do
  mv -v "$f" "${a[$((RANDOM/(32768/${#a[@]})))]}"
done < /dir/file.txt

Now ... these solutions split results "evenly". That's what happens when you multiply the denominator. And because they're random, there's no way to insure that your random numbers won't put all your files into a single directory. So to get a split, you need to be more creative.

Let's assume we're dealing with only two targets (since I think that's what you're doing). If you're looking for a 25/75 split, slice up the random number range accordingly.

$ declare -a b=([0]="red/" [8192]="blue/")
$ for f in {1..10}; do n=$RANDOM; for i in "${!b[@]}"; do [ $i -gt $n ] && break; o="${b[i]}"; done; mv -v "$f" "$o"; done

Broken out for easier reading, here's what we've got, with comments:

declare -a b=([0]="red/" [8192]="blue/")

for f in {1..10}; do         # Step through our files...
  n=$RANDOM                  # Pick a random number, 0-32767
  for i in "${!b[@]}"; do    # Step through the indices of the array of targets
    [ $i -gt $n ] && break   # If the current index is > than the random number, stop.
    o="${b[i]}"              # If we haven't stopped, name this as our target,
  done
  mv -v "$f" "$o"            # and move the file there.
done

We define our split using the index of an array. 8192 is 25% of 32767, the max value of $RANDOM. You could split things however you like within this range, including amongst more than 2.

If you want to test the results of this method, counting results in an array is a way to do it. Let's build a shell function to help with testing.

$ tester() { declare -A c=(); for f in {1..10000}; do n=$RANDOM; for i in "${!b[@]}"; do [ $i -gt $n ] && break; o="${b[i]}"; done; ((c[$o]++)); done; declare -p c; }
$ declare -a b='([0]="red/" [8192]="blue/")'
$ tester
declare -A c='([blue/]="7540" [red/]="2460" )'
$ b=([0]="red/" [10992]="blue/")
$ tester
declare -A c='([blue/]="6633" [red/]="3367" )'

On the first line, we define our function. Second line sets the "b" array with a 25/75 split, then we run the function, whose output is the the counter array. Then we redefine the "b" array with a 33/67 split (or so), and run the function again to demonstrate results.

So... While you certainly could use python for this, you can almost certainly achieve what you need with bash and a little math.

Randomly distribute files into train/test given a ratio

Tags:

python

bash

text-files

file-handling

train-test-split

Mønster

2 Answers

alvas

ghoti

Recent Activity

Donate For Us

Randomly distribute files into train/test given a ratio

Tags:

python

bash

text-files

file-handling

train-test-split

Mønster

2 Answers

alvas

ghoti

Related questions

Recent Activity

Donate For Us