Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is numpy array and python list optimized to be dynamically growing?

I have done over the time many things that require me using the list's .append() function, and also numpy.append() function for numpy arrays. I noticed that both grow really slow when sizes of the arrays are big.

I need an array that is dynamically growing for sizes of about 1 million elements. I can implement this myself, just like std::vector is made in C++, by adding buffer length (reserve length) that is not accessible from the outside. But do I have to reinvent the wheel? I imagine it should be implemented somewhere. So my question is: Does such a thing exist already in Python?

What I mean: Is there in Python an array type that is capable of dynamically growing with time complexity of O(C) most of the time?

like image 867
The Quantum Physicist Avatar asked Oct 29 '22 15:10

The Quantum Physicist


1 Answers

The memory of numpy arrays is well described in its docs, and has been discussed here a lot. List memory layout has also been discussed, though usually just contrast to numpy.

A numpy array has a fixed size data buffer. 'growing' it requires creating a new array, and copying data to it. np.concatenate does that in compiled code. np.append as well as all the stack functions use concatenate.

A list has, as I understand it, a contiguous data buffer that contains pointers to objects else where in memeory. Python maintains some freespace in that buffer, so additions with list.append are relatively fast and easy. But when the freespace fills up, it has to create a new buffer and copy pointers. I can see where that could get expensive with large lists.

So a list will have store a pointer for each element, plus the element itself (e.g. a float) somewhere else in memory. In contrast the array of floats stores the floats themselves as contiguous bytes in its buffer. (Object dtype arrays are more like lists).

The recommended way to create an array iteratively is to build the list with append, and create the array once at the end. Repeated np.append or np.concatenate is relatively expensive.

deque was mentioned. I don't know much about how it stores its data. The docs say it can add elements at the start just as easily as at the end, but random access is slower than for a list. That implies that it stores data in some sort of linked list, so that finding the nth element requires traversing the n-1 links before it. So there's a trade off between growth ease and access speed.

Adding elements to the start of a list requires making a new list of pointers, with the new one(s) at the start. So adding, and removing elements from the start of a regular list, is much more expensive than doing that at the end.

Recommending software is outside of the core SO purpose. Others may make suggestions, but don't be surprised if this gets closed.

There are file formats like HDF5 that a designed for large data sets. They accommodate growth with features like 'chunking'. And there are all kinds of database packages.

like image 136
hpaulj Avatar answered Nov 15 '22 06:11

hpaulj