I have some data and I want to split it up into smaller groups that maintain a common ratio. I wrote a function that will take an input of two array and calculate the size ratio and then tell me the options for how many groups I can split it into (if all the groups are the same size), here is the function:
def cross_validation_group(train_data, test_data):
import numpy as np
from calculator import factors
test_length = len(test_data)
train_length = len(train_data)
total_length = test_length + train_length
ratio = test_length/float(total_length)
possibilities = factors(total_length)
print possibilities
print possibilities[len(possibilities)-1] * ratio
super_count = 0
for i in possibilities:
if i < len(possibilities)/2:
pass
else:
attempt = float(i * ratio)
if attempt.is_integer():
print str(i) + " is an option for total size with " + str(attempt) + " as test size and " + str(i - attempt) + " as train size! This is with " + str(total_length/i) + " folds."
else:
pass
folds = int(raw_input("So how many folds would you like to use? If no possibilities were given that would be sufficient, type 0: "))
if folds != 0:
total_size = total_length/folds
test_size = float(total_size * ratio)
train_size = total_size - test_size
columns = train_data[0]
columns= len(columns)
groups = np.empty((folds,(test_size + train_size),columns))
i = 0
a = 0
b = 0
for j in range (0,folds):
test_size_new = test_size * (j + 1)
train_size_new = train_size * j
total_size_new = (train_size + test_size) * (j + 1)
cut_off = total_size_new - train_size
p = 0
while i < total_size_new:
if i < cut_off:
groups[j,p] = test_data[a]
a += 1
else:
groups[j,p] = train_data[b]
b += 1
i += 1
p += 1
return groups
else:
print "This method cannot be used because the ratio cannot be maintained with equal group sizes other than for the options you were givens"
So my question is how can I make it so that a third input to the function that will be the number of folds and change the function around so that rather than iterating through to make sure that each group has the same amount with the right ratio, it will just have the right ratio, but varying sizes?
Addition for @JamesHolderness
So your method is almost perfect, but here is one issue:
with lengths 357 and 143 with 9 folds, this is the returning list:
[(39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16)]
now when you add up the columns, you get this: 351 144
the 351 is fine because it's less than 357, but the 144 doesn't work because it is greater than 143! The reason for this is that 357 and 143 are lengths of arrays, so the 144th row of that array does not exist...
Here's an algorithm that I think might work for you.
You take the test_length and train_length and divide by their GCD to get the ratio as a simple fraction. You take the numerator and denominator and you add them together, and that is the size factor for your groups.
For example if the ratio is 3:2, the size of each group must be a multiple of 5.
You then take the total_length and divide it by the number of folds to get the ideal size for the first group, which may well be a floating point number. You find the largest multiple of 5 that is less than or equal to that, and that is your first group.
Subtract that value from your total, and divide by by folds-1 to get the ideal size for the next group. Again find the largest multiple of 5, subtract the from the total, and continue until you have calculated all the groups.
Some example code:
total_length = test_length + train_length
divisor = gcd(test_length,train_length)
test_multiple = test_length/divisor
train_multiple = train_length/divisor
total_multiple = test_multiple + train_multiple
# Adjust the ratio if there isn't enough data for the requested folds
if total_length/total_multiple < folds:
total_multiple = total_length/folds
test_multiple = int(round(float(test_length)*total_multiple/total_length))
train_multiple = total_multiple - test_multiple
groups = []
for i in range(folds,0,-1):
float_size = float(total_length)/i
int_size = int(float_size/total_multiple)*total_multiple
test_size = int_size*test_multiple/total_multiple
train_size = int_size*train_multiple/total_multiple
test_length -= test_size # keep track of the test data used
train_length -= train_size # keep track of the train data used
total_length -= int_size
groups.append((test_size,train_size))
# If the test_length or train_length are negative, we need to adjust the groups
# to "give back" some of the data.
distribute_overrun(groups,test_length,0)
distribute_overrun(groups,train_length,1)
This has been updated to keep track of the size used from each group (test and train) but not worry if we use too much initially.
Then at the end, if there's any overrun (i.e. test_length
or train_length
have gone negative), we distribute that overrun back into the groups by decrementing the appropriate side of the ratio in as many items as it takes to bring the overrun back to zero.
The distribute_overrun
function is included below.
def distribute_overrun(groups,overrun,part):
i = 0
while overrun < 0:
group = list(groups[i])
group[part] -= 1
groups[i] = tuple(group)
overrun += 1
i += 1
At the end of that, groups will be a list of tuples containing the test_size and train_size for each group.
If that sounds like the sort of thing you want, but you need me to expand on the code example, just let me know.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With