I have done over the time many things that require me using the list
's .append()
function, and also numpy.append()
function for numpy arrays. I noticed that both grow really slow when sizes of the arrays are big.
I need an array that is dynamically growing for sizes of about 1 million elements. I can implement this myself, just like std::vector
is made in C++, by adding buffer length (reserve length) that is not accessible from the outside. But do I have to reinvent the wheel? I imagine it should be implemented somewhere. So my question is: Does such a thing exist already in Python?
What I mean: Is there in Python an array type that is capable of dynamically growing with time complexity of O(C)
most of the time?
The memory of numpy arrays is well described in its docs, and has been discussed here a lot. List memory layout has also been discussed, though usually just contrast to numpy.
A numpy array has a fixed size data buffer. 'growing' it requires creating a new array, and copying data to it. np.concatenate
does that in compiled code. np.append
as well as all the stack
functions use concatenate
.
A list has, as I understand it, a contiguous data buffer that contains pointers to objects else where in memeory. Python maintains some freespace in that buffer, so additions with list.append
are relatively fast and easy. But when the freespace fills up, it has to create a new buffer and copy pointers. I can see where that could get expensive with large lists.
So a list will have store a pointer for each element, plus the element itself (e.g. a float) somewhere else in memory. In contrast the array of floats stores the floats themselves as contiguous bytes in its buffer. (Object dtype arrays are more like lists).
The recommended way to create an array iteratively is to build the list with append
, and create the array once at the end. Repeated np.append
or np.concatenate
is relatively expensive.
deque
was mentioned. I don't know much about how it stores its data. The docs say it can add elements at the start just as easily as at the end, but random access is slower than for a list. That implies that it stores data in some sort of linked list, so that finding the nth
element requires traversing the n-1
links before it. So there's a trade off between growth ease and access speed.
Adding elements to the start of a list requires making a new list of pointers, with the new one(s) at the start. So adding, and removing elements from the start of a regular list, is much more expensive than doing that at the end.
Recommending software is outside of the core SO purpose. Others may make suggestions, but don't be surprised if this gets closed.
There are file formats like HDF5
that a designed for large data sets. They accommodate growth with features like 'chunking'. And there are all kinds of database packages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With