Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Maintaining a ratio when splitting up data in python function

I have some data and I want to split it up into smaller groups that maintain a common ratio. I wrote a function that will take an input of two array and calculate the size ratio and then tell me the options for how many groups I can split it into (if all the groups are the same size), here is the function:

def cross_validation_group(train_data, test_data):
    import numpy as np
    from calculator import factors
    test_length = len(test_data)
    train_length = len(train_data)
    total_length = test_length + train_length
    ratio = test_length/float(total_length)
    possibilities = factors(total_length)
    print possibilities
    print possibilities[len(possibilities)-1] * ratio
    super_count = 0
    for i in possibilities:
        if i < len(possibilities)/2:
            pass
        else: 
            attempt = float(i * ratio)
            if attempt.is_integer():
                print str(i) + " is an option for total size with " +  str(attempt) + " as test size and " + str(i - attempt) + " as train size! This is with " + str(total_length/i) + " folds."
            else:
                pass
    folds = int(raw_input("So how many folds would you like to use? If no possibilities were given that would be sufficient, type 0: "))
    if folds != 0:
        total_size = total_length/folds
        test_size = float(total_size * ratio)
        train_size = total_size - test_size
        columns = train_data[0]
        columns= len(columns)
        groups = np.empty((folds,(test_size + train_size),columns))
        i = 0
        a = 0
        b = 0
        for j in range (0,folds):
            test_size_new = test_size * (j + 1)
            train_size_new = train_size * j
            total_size_new = (train_size + test_size) * (j + 1)
            cut_off = total_size_new - train_size
            p = 0
            while i < total_size_new:
                if i < cut_off:
                    groups[j,p] = test_data[a]
                    a += 1
                else:
                    groups[j,p] = train_data[b]
                    b += 1
                i += 1
                p += 1
        return groups
    else:
        print "This method cannot be used because the ratio cannot be maintained with equal group sizes other than for the options you were givens"

So my question is how can I make it so that a third input to the function that will be the number of folds and change the function around so that rather than iterating through to make sure that each group has the same amount with the right ratio, it will just have the right ratio, but varying sizes?

Addition for @JamesHolderness

So your method is almost perfect, but here is one issue:

with lengths 357 and 143 with 9 folds, this is the returning list:

[(39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16)]

now when you add up the columns, you get this: 351 144

the 351 is fine because it's less than 357, but the 144 doesn't work because it is greater than 143! The reason for this is that 357 and 143 are lengths of arrays, so the 144th row of that array does not exist...

like image 990
Ryan Saxe Avatar asked Apr 18 '13 22:04

Ryan Saxe


1 Answers

Here's an algorithm that I think might work for you.

You take the test_length and train_length and divide by their GCD to get the ratio as a simple fraction. You take the numerator and denominator and you add them together, and that is the size factor for your groups.

For example if the ratio is 3:2, the size of each group must be a multiple of 5.

You then take the total_length and divide it by the number of folds to get the ideal size for the first group, which may well be a floating point number. You find the largest multiple of 5 that is less than or equal to that, and that is your first group.

Subtract that value from your total, and divide by by folds-1 to get the ideal size for the next group. Again find the largest multiple of 5, subtract the from the total, and continue until you have calculated all the groups.

Some example code:

total_length = test_length + train_length          
divisor = gcd(test_length,train_length)
test_multiple = test_length/divisor
train_multiple = train_length/divisor
total_multiple = test_multiple + train_multiple 

# Adjust the ratio if there isn't enough data for the requested folds
if total_length/total_multiple < folds:
  total_multiple = total_length/folds
  test_multiple = int(round(float(test_length)*total_multiple/total_length))
  train_multiple = total_multiple - test_multiple

groups = []
for i in range(folds,0,-1):
  float_size = float(total_length)/i
  int_size = int(float_size/total_multiple)*total_multiple
  test_size = int_size*test_multiple/total_multiple
  train_size = int_size*train_multiple/total_multiple
  test_length -= test_size    # keep track of the test data used
  train_length -= train_size  # keep track of the train data used
  total_length -= int_size
  groups.append((test_size,train_size))

# If the test_length or train_length are negative, we need to adjust the groups
# to "give back" some of the data.
distribute_overrun(groups,test_length,0)
distribute_overrun(groups,train_length,1)

This has been updated to keep track of the size used from each group (test and train) but not worry if we use too much initially.

Then at the end, if there's any overrun (i.e. test_length or train_length have gone negative), we distribute that overrun back into the groups by decrementing the appropriate side of the ratio in as many items as it takes to bring the overrun back to zero.

The distribute_overrun function is included below.

def distribute_overrun(groups,overrun,part):
    i = 0
    while overrun < 0:
      group = list(groups[i])
      group[part] -= 1
      groups[i] = tuple(group)
      overrun += 1
      i += 1

At the end of that, groups will be a list of tuples containing the test_size and train_size for each group.

If that sounds like the sort of thing you want, but you need me to expand on the code example, just let me know.

like image 169
James Holderness Avatar answered Nov 15 '22 14:11

James Holderness