What is an efficient way to initialize and access elements of a large array in Python? I want to create an array in Python with 100 million entries, unsigned 4-byte integers, initialized to zero. I want fast array access, preferably with contiguous memory. Strangely, NumPy arrays seem to be performing very slow. Are there alternatives I can try? There is the array.array module, but I don't see a method to efficiently allocate a block of 100 million entries. Responses to comments: <ul> <li>I cannot use a sparse array. It will be too slow for this algorithm because the array becomes dense very quickly.</li> <li>I know Python is interpreted, but surely there is a way to do fast array operations?</li> <li>I did some profiling, and I get about 160K array accesses (looking up or updating an element by index) per second with NumPy. This seems very slow.</li> </ul>

I have done some profiling, and the results are completely counterintuitive. For simple array access operations, numpy and array.array are 10x slower than native Python arrays. Note that for array access, I am doing operations of the form: <pre class="prettyprint"><code>a[i] += 1 </code></pre> Profiles: <ul> <li> [0] * 20000000 <ul> <li>Access: 2.3M / sec</li> <li>Initialization: 0.8s</li> </ul> </li> <li> numpy.zeros(shape=(20000000,), dtype=numpy.int32) <ul> <li>Access: 160K/sec</li> <li>Initialization: 0.2s</li> </ul> </li> <li> array.array('L', [0] * 20000000) <ul> <li>Access: 175K/sec</li> <li>Initialization: 2.0s</li> </ul> </li> <li> array.array('L', (0 for i in range(20000000))) <ul> <li>Access: 175K/sec, presumably, based upon the profile for the other array.array</li> <li>Initialization: 6.7s</li> </ul> </li> </ul>

Efficient Python array with 100 million zeros?

2 Answers

I have done some profiling, and the results are completely counterintuitive. For simple array access operations, numpy and array.array are 10x slower than native Python arrays.

Note that for array access, I am doing operations of the form:

a[i] += 1

Profiles:

[0] * 20000000
- Access: 2.3M / sec
- Initialization: 0.8s
numpy.zeros(shape=(20000000,), dtype=numpy.int32)
- Access: 160K/sec
- Initialization: 0.2s
array.array('L', [0] * 20000000)
- Access: 175K/sec
- Initialization: 2.0s
array.array('L', (0 for i in range(20000000)))
- Access: 175K/sec, presumably, based upon the profile for the other array.array
- Initialization: 6.7s

answered Sep 21 '22 19:09

Joseph Turian

Just a reminder how Python's integers work: if you allocate a list by saying

a = [0] * K

you need the memory for the list (sizeof(PyListObject) + K * sizeof(PyObject*)) and the memory for the single integer object 0. As long as the numbers in the list stay below the magic number V that Python uses for caching, you are fine because those are shared, i.e. any name that points to a number n < V points to the exact same object. You can find this value by using the following snippet:

>>> i = 0 >>> j = 0 >>> while i is j: ...    i += 1 ...    j += 1 >>> i # on my system! 257

This means that as soon as the counts go above this number, the memory you need is sizeof(PyListObject) + K * sizeof(PyObject*) + d * sizeof(PyIntObject), where d < K is the number of integers above V (== 256). On a 64 bit system, sizeof(PyIntObject) == 24 and sizeof(PyObject*) == 8, i.e. the worst case memory consumption is 3,200,000,000 bytes.

With numpy.ndarray or array.array, memory consumption is constant after initialization, but you pay for the wrapper objects that are created transparently, as Thomas Wouters said. Probably, you should think about converting the update code (which accesses and increases the positions in the array) to C code, either by using Cython or scipy.weave.

answered Sep 19 '22 19:09

Torsten Marek

Related questions
                            
                                ValueError: feature_names mismatch: in xgboost in the predict() function
                            
                                pandas pivot table to data frame [duplicate]
                            
                                Q: [Pandas] How to efficiently assign unique ID to individuals with multiple entries based on name in very large df
                            
                                brew install python3 didn't install pip3
                            
                                imap - how to delete messages
                            
                                sqlalchemy, select using reverse-inclusive (not in) list of child column values
                            
                                How to get all days in current month?
                            
                                How to cache SQL Alchemy calls with Flask-Cache and Redis?
                            
                                conda install downgrade python version
                            
                                Which python version needs from __future__ import with_statement?
                            
                                Piping popen stderr and stdout
                            
                                Change the file extension for files in a folder?
                            
                                A bit confused with blitting (Pygame)
                            
                                Horizontal box plots in matplotlib/Pandas
                            
                                Sqlite insert query not working with python?
                            
                                How to set a tkinter window to a constant size
                            
                                ln (Natural Log) in Python
                            
                                pyconfig.h missing during "pip install cryptography"
                            
                                pip install dotenv error code 1 Windows 10
                            
                                How to run Docker with python and Java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient Python array with 100 million zeros?

Tags:

performance

python

arrays

Joseph Turian

People also ask

2 Answers

Joseph Turian

Torsten Marek

Recent Activity

Donate For Us