I am at the moment trying make a setup script, capable of setting up a workspace up for me, such that I don't need to do it manually. I started doing this in bash, but quickly realized that would not work that well.
My next idea was to do it using python, but can't seem to do it a proper way.. My idea was to make a list (a list being a .txt files with the paths for all the datafiles), shuffle this list, and then move each file to either my train dir or test dir, given the ratio....
But this is python, isn't there a more simpler way to do it, it seems like I am doing an unessesary workaround just to split the files.
Bash Code:
# Partition data randomly into train and test.
cd ${PATH_TO_DATASET}
SPLIT=0.5 #train/test split
NUMBER_OF_FILES=$(ls ${PATH_TO_DATASET} | wc -l) ## number of directories in the dataset
even=1
echo ${NUMBER_OF_FILES}
if [ `echo "${NUMBER_OF_FILES} % 2" | bc` -eq 0 ]
then
even=1
echo "Even is true"
else
even=0
echo "Even is false"
fi
echo -e "${BLUE}Seperating files in to train and test set!${NC}"
for ((i=1; i<=${NUMBER_OF_FILES}; i++))
do
ran=$(python -c "import random;print(random.uniform(0.0, 1.0))")
if [[ ${ran} < ${SPLIT} ]]
then
##echo "test ${ran}"
cp -R $(ls -d */|sed "${i}q;d") ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data/test/
else
##echo "train ${ran}"
cp -R $(ls -d */|sed "${i}q;d") ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data/train/
fi
##echo $(ls -d */|sed "${i}q;d")
done
cd ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data
NUMBER_TRAIN_FILES=$(ls train/ | wc -l)
NUMBER_TEST_FILES=$(ls test/ | wc -l)
echo "${NUMBER_TRAIN_FILES} and ${NUMBER_TEST_FILES}..."
echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
if [[ ${even} = 1 ]] && [[ ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES} != ${SPLIT} ]]
then
echo "Something need to be fixed!"
if [[ $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES}) > ${SPLIT} ]]
then
echo "Too many files in the TRAIN set move some to TEST"
cd train
echo $(pwd)
while [[ ${NUMBER_TRAIN_FILES}/${NUMBER_TEST_FILES} != ${SPLIT} ]]
do
mv $(ls -d */|sed "1q;d") ../test/
echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
done
else
echo "Too many files in the TEST set move some to TRAIN"
cd test
while [[ ${NUMBER_TRAIN_FILES}/${NUMBER_TEST_FILES} != ${SPLIT} ]]
do
mv $(ls -d */|sed "1q;d") ../train/
echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
done
fi
fi
My problem were the last part. Since i picking the numbers by random, I would not be sure that the data would be partitioned as hoped, which my last if statement were to check whether the partition was done right, and if not then fix it.. This was not possible since i am checking floating points, and the solution in general became more like a quick fix.
scikit-learn
comes to the rescue =)
>>> import numpy as np
>>> from sklearn.cross_validation import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> y
[0, 1, 2, 3, 4]
# If i want 1/4 of the data for testing
# and i set a random seed of 42.
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> X_test
array([[2, 3],
[8, 9]])
>>> y_train
[2, 0, 3]
>>> y_test
[1, 4]
See http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
To demonstrate:
alvas@ubi:~$ mkdir splitfileproblem
alvas@ubi:~$ cd splitfileproblem/
alvas@ubi:~/splitfileproblem$ mkdir original
alvas@ubi:~/splitfileproblem$ mkdir train
alvas@ubi:~/splitfileproblem$ mkdir test
alvas@ubi:~/splitfileproblem$ ls
original train test
alvas@ubi:~/splitfileproblem$ cd original/
alvas@ubi:~/splitfileproblem/original$ ls
alvas@ubi:~/splitfileproblem/original$ echo 'abc' > a.txt
alvas@ubi:~/splitfileproblem/original$ echo 'def\nghi' > b.txt
alvas@ubi:~/splitfileproblem/original$ cat a.txt
abc
alvas@ubi:~/splitfileproblem/original$ echo -e 'def\nghi' > b.txt
alvas@ubi:~/splitfileproblem/original$ cat b.txt
def
ghi
alvas@ubi:~/splitfileproblem/original$ echo -e 'jkl' > c.txt
alvas@ubi:~/splitfileproblem/original$ echo -e 'mno' > d.txt
alvas@ubi:~/splitfileproblem/original$ ls
a.txt b.txt c.txt d.txt
In Python:
alvas@ubi:~/splitfileproblem$ ls
original test train
alvas@ubi:~/splitfileproblem$ python
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> from sklearn.cross_validation import train_test_split
>>> os.listdir('original')
['b.txt', 'd.txt', 'c.txt', 'a.txt']
>>> X = y= os.listdir('original')
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
>>> X_train
['a.txt', 'd.txt', 'b.txt']
>>> X_test
['c.txt']
Now move the files:
>>> for x in X_train:
... os.rename('original/'+x , 'train/'+x)
...
>>> for x in X_test:
... os.rename('original/'+x , 'test/'+x)
...
>>> os.listdir('test')
['c.txt']
>>> os.listdir('train')
['b.txt', 'd.txt', 'a.txt']
>>> os.listdir('original')
[]
See also: How to move a file in Python
Here's a simple example that uses bash's $RANDOM
to move things to one of two target directories.
$ touch {1..10}
$ mkdir red blue
$ a=(*/)
$ RANDOM=$$
$ for f in [0-9]*; do mv -v "$f" "${a[$((RANDOM/(32768/${#a[@]})))]}"; done
1 -> red/1
10 -> red/10
2 -> blue/2
3 -> red/3
4 -> red/4
5 -> red/5
6 -> red/6
7 -> blue/7
8 -> blue/8
9 -> blue/9
This example starts with the creation of 10 files and two target directories. It sets an array to */
which expands to "all the directories within the current directory". It then runs a for loop with what looks like line noise in it. I'll break it apart for ya.
"${a[$((RANDOM/(32768/${#a[@]})+1))]}"
is:
${a[
... the array "a",$((...))
... whose subscript is an integer math function.$RANDOM
is a bash variable that generates a ramdom(ish) number from 0 to 32767, and our formula divides the denominator of that ratio by:${#a[@]}
, effectively multiplying RANDOM/32768
by the number of elements in the array "a".The result of all this is that we pick a random array element, a.k.a. a random directory.
If you really want to work from your "list of files", and assuming you leave your list of potential targets in the array "a", you could replace the for loop with a while loop:
while read f; do
mv -v "$f" "${a[$((RANDOM/(32768/${#a[@]})))]}"
done < /dir/file.txt
Now ... these solutions split results "evenly". That's what happens when you multiply the denominator. And because they're random, there's no way to insure that your random numbers won't put all your files into a single directory. So to get a split, you need to be more creative.
Let's assume we're dealing with only two targets (since I think that's what you're doing). If you're looking for a 25/75 split, slice up the random number range accordingly.
$ declare -a b=([0]="red/" [8192]="blue/")
$ for f in {1..10}; do n=$RANDOM; for i in "${!b[@]}"; do [ $i -gt $n ] && break; o="${b[i]}"; done; mv -v "$f" "$o"; done
Broken out for easier reading, here's what we've got, with comments:
declare -a b=([0]="red/" [8192]="blue/")
for f in {1..10}; do # Step through our files...
n=$RANDOM # Pick a random number, 0-32767
for i in "${!b[@]}"; do # Step through the indices of the array of targets
[ $i -gt $n ] && break # If the current index is > than the random number, stop.
o="${b[i]}" # If we haven't stopped, name this as our target,
done
mv -v "$f" "$o" # and move the file there.
done
We define our split using the index of an array. 8192 is 25% of 32767, the max value of $RANDOM. You could split things however you like within this range, including amongst more than 2.
If you want to test the results of this method, counting results in an array is a way to do it. Let's build a shell function to help with testing.
$ tester() { declare -A c=(); for f in {1..10000}; do n=$RANDOM; for i in "${!b[@]}"; do [ $i -gt $n ] && break; o="${b[i]}"; done; ((c[$o]++)); done; declare -p c; }
$ declare -a b='([0]="red/" [8192]="blue/")'
$ tester
declare -A c='([blue/]="7540" [red/]="2460" )'
$ b=([0]="red/" [10992]="blue/")
$ tester
declare -A c='([blue/]="6633" [red/]="3367" )'
On the first line, we define our function. Second line sets the "b" array with a 25/75 split, then we run the function, whose output is the the counter array. Then we redefine the "b" array with a 33/67 split (or so), and run the function again to demonstrate results.
So... While you certainly could use python for this, you can almost certainly achieve what you need with bash and a little math.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With