I have issues to understand the following coding and I am new to python:
data_a, data_b, data_C = np.split(original_data.sample(frac=1, random_state=1729),
[int(0.7 * len(original_data)), int(0.9*len(original_data))])
so my original data set has a complete of 38000 rows. After this split method the data_a has 26600 rows. Now data_b has 7600 rows, data_c has 3800 rows. So I do get that 70% of original_data will be 26600 rows. But why does data_b has 7600 rows and data_c 3800. I read the documentation about that split method and from what I understand the coding I would have suggested that for the rest of 30% data from my initial 38000 rows, 90% will split into data_b that would be 10260 rows. Not 7600 rows.
You have do it sequentially, if you want split the remaining 30% into 90-10. Try this!
data_a, remaining_data = np.split(original_data.sample(frac=1, random_state=1729),
[int(0.7 * len(original_data))])
data_b, data_C = np.split(remaining_data,[int(0.9 * len(remaining_data))])
data_a.shape, data_b.shape, data_C.shape
output:
((26600,), (10260,), (1140,))
the splits percentages there are relative to the original dataset, so if you want data_b to be 90% of the 30% left after the first split you need to do something like this
data_a, data_b, data_C = np.split(original_data.sample(frac=1, random_state=1729), [int(0.7 * len(original_data)), int(0.97*len(original_data))])
that is because you specify the split points rather than the ratios of result data sets
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With