I'm trying to split a large data frame with cycle data into smaller data frames of equal , or near equal, cycle length. Array_split was working great until my data would not allow equal split (worked fine with 500,000 cycles,but not with 1,190,508). I'm wanting the sections to be in 1000 cycle increments (except the last frame would be less).
Here's the scenario:
d = {
'a': pd.Series(random(1190508)),
'b': pd.Series(random(1190508)),
'c': pd.Series(random(1190508)),
}
frame = pd.DataFrame(d)
cycles = 1000
sections = math.ceil(len(frame)/cycles)
split_frames = np.array_split(frame, sections)
The docs show array_split basically splitting even groups while it can, then making smaller group at the end because the data can't be divided evenly. This is what I want, but currently, if I look at the lengths of each frame in this new split_frames list:
split_len = pd.DataFrame([len(a) for a in split_frame])
split_len.to_csv('lengths.csv')
the lengths of the first 698 frames are 1000 elements, but then the rest (frame 699 to 1190) are 999 elements in length.
It seems to make this randomly occurring break in length no matter what number I pass for sections (rounding, even number, or whatever else).
I'm struggling to understand why it's not creating equal frame lengths except the last one like in the docs:
>>> x = np.arange(8.0)
>>> np.array_split(x, 3)
[array([ 0., 1., 2.]), array([ 3., 4., 5.]), array([ 6., 7.])]
Any help is appreciated, thanks!
array_split doesn't make a number of equal sections and one with the leftovers. If you split an array of length l into n sections, it makes l % n sections of size l//n + 1 and the rest of size l//n. See the source for more details. (This really ought to be explained in the docs.)
Update: as of NumPy 1.14, this is now explained in the docs.
As @user2357112 writes, array_split doesn't do what you think it does... but by looking at the docs, it's hard to know what it does, anyways. In fact, I'd say that its behavior is undefined. We expect it to return something, but we don't know what properties that something will have.
To get what you want, I'd use numpy.split's ability to provide custom indices. So, for example:
def greedy_split(arr, n, axis=0):
"""Greedily splits an array into n blocks.
Splits array arr along axis into n blocks such that:
- blocks 1 through n-1 are all the same size
- the sum of all block sizes is equal to arr.shape[axis]
- the last block is nonempty, and not bigger than the other blocks
Intuitively, this "greedily" splits the array along the axis by making
the first blocks as big as possible, then putting the leftovers in the
last block.
"""
length = arr.shape[axis]
# compute the size of each of the first n-1 blocks
block_size = np.ceil(length / float(n))
# the indices at which the splits will occur
ix = np.arange(block_size, length, block_size)
return np.split(arr, ix, axis)
Some examples:
>>> x = np.arange(10)
>>> greedy_split(x, 2)
[array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])]
>>> greedy_split(x, 3)
[array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([8, 9])]
>>> greedy_split(x, 4)
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]), array([9])]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With