Split numpy array into chunks by maxmimum size

Tags:

I have some very large two-dimensional numpy arrays. One data set is 55732 by 257659, which is over 14 billion elements. Because some operations I need to perform throw MemoryErrors, I would like to try splitting the array up into chunks of a certain size and running them against the chunks. (I can aggregate the results after the operation runs on each piece.) The fact that my problem is MemoryErrors means that it's important that I can cap the size of the arrays somehow, rather than split them into a constant number of pieces.

For an example, let's generate a 1009 by 1009 random array:

a = numpy.random.choice([1,2,3,4], (1009,1009))

My data isn't necessary possible to split evenly, and it's definitely not guaranteed to be splitable by the size I want. So I've chosen 1009 because it's prime.

Let's also say I want them in chunks of no larger than 50 by 50. Since this is just to avoid errors with extremely large arrays, it's okay if the result isn't exact.

How can I split this into the desired chunks?

I'm using Python 3.6 64-bit with numpy 1.14.3 (latest).

I have seen this function that uses reshape, but it doesn't work if the number of rows and columns doesn't divide the size exactly.

This question (among other similar ones) has answers explaining how to split into a certain number of chunks, but this does not explain how to split into a certain size.

I also saw this question, since it's actually my exact problem. The answer and comments suggests switching to 64 bit (which I already have) and using numpy.memmap. Neither helped.

607

asked May 12 '18 11:05

jpmc26

1 Answers

This can be done so that the resulting arrays have shapes slightly less than the desired maximum or so that they have exactly the desired maximum except for some remainder at the end.

The basic logic is to compute the parameters for splitting the array and then use array_split to split the array along each axis (or dimension) of the array.

We'll need the numpy and math modules and the example array:

import math
import numpy

a = numpy.random.choice([1,2,3,4], (1009,1009))

Slightly less than max

The logic

First store the shape of the final chunk size along each dimension that you want to split it into in a tuple:

chunk_shape = (50, 50)

array_split only splits along one axis (or dimension) or an array at a time. So let's start with just the first axis.

Compute the number of sections we need to split the array into:
```
num_sections = math.ceil(a.shape[0] / chunk_shape[0])
```
In our example case, this is 21 (1009 / 50 = 20.18).

Now split it:

first_split = numpy.array_split(a, num_sections, axis=0)

This gives us a list of 21 (the number of requested sections) numpy arrays that are split so they are no larger than 50 in the first dimension:

print(len(first_split))
# 21
print({i.shape for i in first_split})
# {(48, 1009), (49, 1009)}
# These are the distinct shapes, so we don't see all 21 separately

In this case, they're 48 and 49 along that axis.

We can do the same thing to each new array for the second dimension:

num_sections = math.ceil(a.shape[1] / chunk_shape[1])
second_split = [numpy.array_split(a2, num_sections, axis=1) for a2 in first_split]

This gives us a list of lists. Each sublist contains numpy arrays of the size we wanted:

print(len(second_split))
# 21
print({len(i) for i in second_split})
# {21}
# All sublists are 21 long
print({i2.shape for i in second_split for i2 in i})
# {(48, 49), (49, 48), (48, 48), (49, 49)}
# Distinct shapes

The full function

We can implement this for arbitrary dimensions using a recursive function:

def split_to_approx_shape(a, chunk_shape, start_axis=0):
    if len(chunk_shape) != len(a.shape):
        raise ValueError('chunk length does not match array number of axes')

    if start_axis == len(a.shape):
        return a

    num_sections = math.ceil(a.shape[start_axis] / chunk_shape[start_axis])
    split = numpy.array_split(a, num_sections, axis=start_axis)
    return [split_to_approx_shape(split_a, chunk_shape, start_axis + 1) for split_a in split]

We call it like so:

full_split = split_to_approx_shape(a, (50,50))
print({i2.shape for i in full_split for i2 in i})
# {(48, 49), (49, 48), (48, 48), (49, 49)}
# Distinct shapes

Exact shapes plus remainder

The logic

If we want to be a little fancier and have all the new arrays be exactly the specified size except for a trailing leftover array, we can do that by passing a list of indices to split at to array_split.

First build up the array of indices:

axis = 0
split_indices = [chunk_shape[axis]*(i+1) for i  in range(math.floor(a.shape[axis] / chunk_shape[axis]))]

This gives use a list of indices, each 50 from the last:

print(split_indices)
# [50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000]

Then split:

first_split = numpy.array_split(a, split_indices, axis=0)
print(len(first_split))
# 21
print({i.shape for i in first_split})
# {(9, 1009), (50, 1009)}
# Distinct shapes, so we don't see all 21 separately
print((first_split[0].shape, first_split[1].shape, '...', first_split[-2].shape, first_split[-1].shape))
# ((50, 1009), (50, 1009), '...', (50, 1009), (9, 1009))

And then again for the second axis:

axis = 1
split_indices = [chunk_shape[axis]*(i+1) for i  in range(math.floor(a.shape[axis] / chunk_shape[axis]))]
second_split = [numpy.array_split(a2, split_indices, axis=1) for a2 in first_split]
print({i2.shape for i in second_split for i2 in i})
# {(9, 50), (9, 9), (50, 9), (50, 50)}

The full function

Adapting the recursive function:

def split_to_shape(a, chunk_shape, start_axis=0):
    if len(chunk_shape) != len(a.shape):
        raise ValueError('chunk length does not match array number of axes')

    if start_axis == len(a.shape):
        return a

    split_indices = [
        chunk_shape[start_axis]*(i+1)
        for i in range(math.floor(a.shape[start_axis] / chunk_shape[start_axis]))
    ]
    split = numpy.array_split(a, split_indices, axis=start_axis)
    return [split_to_shape(split_a, chunk_shape, start_axis + 1) for split_a in split]

And we call it exactly the same way:

full_split = split_to_shape(a, (50,50))
print({i2.shape for i in full_split for i2 in i})
# {(9, 50), (9, 9), (50, 9), (50, 50)}
# Distinct shapes

Extra Notes

Performance

These functions seem to be quite fast. I was able to split up my example array (with over 14 billion elements) into 1000 by 1000 shaped pieces (resulting in over 14000 new arrays) in under 0.05 seconds with either function:

print('Building test array')
a = numpy.random.randint(4, size=(55000, 250000), dtype='uint8')
chunks = (1000, 1000)
numtests = 1000
print('Running {} tests'.format(numtests))
print('split_to_approx_shape: {} seconds'.format(timeit.timeit(lambda: split_to_approx_shape(a, chunks), number=numtests) / numtests))
print('split_to_shape: {} seconds'.format(timeit.timeit(lambda: split_to_shape(a, chunks), number=numtests) / numtests))

Output:

Building test array
Running 1000 tests
split_to_approx_shape: 0.035109398348040485 seconds
split_to_shape: 0.03113800323300747 seconds

I did not test speed with higher dimension arrays.

Shapes smaller than the max

These functions both work properly if the size of any dimension is less than the specified maximum. This requires no special logic.

answered Oct 15 '22 11:10

jpmc26

Related questions
                            
                                Excel is not opening csv file when index=False option is selected in to_csv command
                            
                                python zeep: send un-escaped xml as content
                            
                                react frontend connecting to flask backend Howto
                            
                                PySpark - Convert column of Lists to Rows
                            
                                pandas: subtracting two columns and saving result as an absolute
                            
                                Pandas - groupby columns with conditions from another column
                            
                                Audio: Converting between M4A and FLAC, WAV, or raw
                            
                                Tensorflow restore while ignoring scope name or into new scope name
                            
                                Errors installing PyAudio "Failed Building Wheel for PyAudio"
                            
                                Why is groupby and rolling not working together?
                            
                                Why does numpy.savez() output non reproducible files?
                            
                                error in pipenv setup command: 'install_requires' must be a string or list of strings containing valid project/version requirement specifiers
                            
                                Plot dataframe then add vertical lines; how get custom legend text for all?
                            
                                Running Python code in Vim without saving
                            
                                'No module named spacy' in ipython, but works fine in regular python interpretter
                            
                                A module is working in Anaconda prompt, but not in Spyder
                            
                                highlight (color) a panda data frame row by index
                            
                                Interpreting sklearns' GridSearchCV best score
                            
                                Pixel-wise loss weight for image segmentation in Keras
                            
                                Python: How to get multiple return values from a threaded function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split numpy array into chunks by maxmimum size

Tags:

python

numpy

Related

jpmc26

People also ask

1 Answers

Slightly less than max

The logic

The full function

Exact shapes plus remainder

The logic

The full function

Extra Notes

Performance

Shapes smaller than the max

jpmc26

Recent Activity

Donate For Us