What is an efficient way to initialize and access elements of a large array in Python?
I want to create an array in Python with 100 million entries, unsigned 4-byte integers, initialized to zero. I want fast array access, preferably with contiguous memory.
Strangely, NumPy arrays seem to be performing very slow. Are there alternatives I can try?
There is the array.array module, but I don't see a method to efficiently allocate a block of 100 million entries.
Responses to comments:
To give you a more direct answer to your question about performance: Arrays are more efficient than lists for some uses. If you need to allocate an array that you KNOW will not change, then arrays can be faster and use less memory.
I have done some profiling, and the results are completely counterintuitive. For simple array access operations, numpy and array.array are 10x slower than native Python arrays.
Note that for array access, I am doing operations of the form:
a[i] += 1
Profiles:
[0] * 20000000
numpy.zeros(shape=(20000000,), dtype=numpy.int32)
array.array('L', [0] * 20000000)
array.array('L', (0 for i in range(20000000)))
Just a reminder how Python's integers work: if you allocate a list by saying
a = [0] * K
you need the memory for the list (sizeof(PyListObject) + K * sizeof(PyObject*)
) and the memory for the single integer object 0
. As long as the numbers in the list stay below the magic number V
that Python uses for caching, you are fine because those are shared, i.e. any name that points to a number n < V
points to the exact same object. You can find this value by using the following snippet:
>>> i = 0 >>> j = 0 >>> while i is j: ... i += 1 ... j += 1 >>> i # on my system! 257
This means that as soon as the counts go above this number, the memory you need is sizeof(PyListObject) + K * sizeof(PyObject*) + d * sizeof(PyIntObject)
, where d < K
is the number of integers above V (== 256)
. On a 64 bit system, sizeof(PyIntObject) == 24
and sizeof(PyObject*) == 8
, i.e. the worst case memory consumption is 3,200,000,000 bytes.
With numpy.ndarray
or array.array
, memory consumption is constant after initialization, but you pay for the wrapper objects that are created transparently, as Thomas Wouters said. Probably, you should think about converting the update code (which accesses and increases the positions in the array) to C code, either by using Cython or scipy.weave
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With