Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy array size vs. speed of concatenation

I am concatenating data to a numpy array like this:

xdata_test = np.concatenate((xdata_test,additional_X))

This is done a thousand times. The arrays have dtype float32, and their sizes are shown below:

xdata_test.shape   :  (x1,40,24,24)        (x1 : [500~10500])   
additional_X.shape :  (x2,40,24,24)        (x2 : [0 ~ 500])

The problem is that when x1 is larger than ~2000-3000, the concatenation takes a lot longer.

The graph below plots the concatenation time versus the size of the x2 dimension:

x2 vs time consumption

Is this a memory issue or a basic characteristic of numpy?

like image 296
MJ.Shin Avatar asked Jan 12 '16 15:01

MJ.Shin


2 Answers

As far as I understand numpy, all the stack and concatenate functions are not extremely efficient. And for good reasons, because numpy tries to keep array memory contiguous for efficiency (see this link about contiguous arrays in numpy)

That means that every concatenate operation have to copy the whole data every time. When I need to concatenate a bunch of elements together I tend to do this :

l = []
for additional_X in ...:
    l.append(addiional_X)
xdata_test = np.concatenate(l)

That way, the costly operation of moving the whole data is only done once.

NB : would be interested in the speed improvement that gives you.

like image 81
Benoit Seguin Avatar answered Oct 07 '22 07:10

Benoit Seguin


If you have in advance the arrays you want to concatenate, I would suggest creating a new array with the total shape and fill it with the small arrays rather than concatenating, as every concatenation operation needs to copy the whole data to a new contiguous space of memory.

  • First, calculate the total size of the first axis:

    max_x = 0
    for arr in list_of_arrays:
        max_x += arr.shape[0]
    
  • Second, create the end container:

    final_data = np.empty((max_x,) + xdata_test.shape[1:], dtype=xdata_test.dtype)
    

    which is equivalent to (max_x, 40, 24, 24) but dynamically typed.

  • Last, fill the numpy array:

    curr_x = 0
    for arr in list_of_arrays:
        final_data[curr_x:curr_x+arr.shape[0]] = arr
        curr_x += arr.shape[0]
    

The loop above, copies each of the arrays to a previously defined column/rows of the larger array.

By doing this, each of the N arrays will be copied to the exact final destination, rather than creating temporal arrays for each of the concatenation.

like image 5
Imanol Luengo Avatar answered Oct 07 '22 09:10

Imanol Luengo