Python lists/dictionaries vs. numpy arrays: performance vs. memory control

Case 1 (lists):

Using lists (or collections.deque()) for "appending" new data arrays, the code is efficient. But, when I concatenate the arrays stored in the lists, the memory grows and I did not manage to free it again. Example:

filename = 'test'
# data file with a matrix of shape (98, 56)
nFields = 56
# Initialize data dictionary and list of fields
dataDict = {}

# data directory: each entry contains a list 
field_names = []
for i in xrange(nFields):
    field_names.append(repr(i))
    dataDict[repr(i)] = []

# Read a data file N times (it represents N files reading)
# file contains 56 fields of arbitrary length in the example
# Append each time the data fields to the lists (in the data dictionary)
N = 10000
for j in xrange(N):
    xy = np.loadtxt(filename)
    for i,field in enumerate(field_names):
        dataDict[field].append(xy[:,i])

# concatenate list members (arrays) to a numpy array 
for key,value in dataDict.iteritems():
    dataDict[key] = np.concatenate(value,axis=0)

Computing time: 63.4 s
Memory usage (top): 13862 gime_se 20 0 1042m 934m 4148 S 0 5.8 1:00.44 python

Case 2 (numpy arrays):

Concatenating directly the numpy arrays each time they are readed, it is inefficient but memory remains under control. Example:

nFields = 56
dataDict = {}
# data directory: each entry contains a list 
field_names = []
for i in xrange(nFields):
    field_names.append(repr(i))
    dataDict[repr(i)] = np.array([])

# Read a data file N times (it represents N files reading)
# Concatenate data fields to numpy arrays (in the data dictionary)
N = 10000
for j in xrange(N):
    xy = np.loadtxt(filename)
    for i,field in enumerate(field_names):
        dataDict[field] = np.concatenate((dataDict[field],xy[:,i]))

Computing time: 1377.8 s
Memory usage (top): 14850 gime_se 20 0 650m 542m 4144 S 0 3.4 22:31.21 python

Question(s):

Is there any way of having the performance of Case 1 but keeping the memory under control as in Case 2?
It seems that in case 1, the memory grows when concatenating list members (np.concatenate(value,axis=0)). Better ideas of doing it?

280

asked Feb 08 '11 16:02

chan gimeno

1 Answers

Here's what is going on based on what I've observed. There isn't really a memory leak. Instead, Python's memory management code (possibly in connection with the memory management of whatever OS you are in) is deciding to keep the space used by the original dictionary (the one without the concatenated arrays) in the program. However, it is free to be reused. I proved this by doing the following:

Making the code you gave as an answer into a function that returned dataDict.
Calling the function twice and assigning the results to two different variables.

When I do this, I find that the amount of memory used only increased from ~900 GB to ~1.3 GB. Without the extra dictionary memory, the Numpy data itself should take up about 427 MB by my calculations so this adds up. The second initial, unconcatenated dictionary that our function created just used the already allocated memory.

If you really can't use more than ~600 MB of memory, then I would recommend doing with your Numpy arrays somewhat like what is done internally with Python lists: allocate an array with a certain number of columns and when you've used those up, create an enlarged array with more columns and copy the data over. This will reduce the number of concatenations, meaning it will be faster (though still not as fast as lists), while keeping the memory used down. Of course, it is also more of a pain to implement.

answered Oct 20 '22 17:10

Justin Peel

Related questions
                            
                                How to save to disk an sklearn model with its out-of-file dependencies?
                            
                                Issue with embedding layer when serving a Tensorflow/Keras model with TF 2.0
                            
                                Set PYTHONPATH for local Jupyter Notebook in VS Code
                            
                                Memory not freed after Python's multiprocessing Pool is finished
                            
                                get rid of white border around option menu
                            
                                Pip upgrade cannot install packages
                            
                                Plotly: How to embed a fully interactive Plotly figure in Excel?
                            
                                Eigenvalues in Python: A Bug?
                            
                                Bubble Shuffle - Weighted Shuffle
                            
                                "Django-insecure" in secret key in settings.py in django
                            
                                Coding Custom Likelihood Pymc3
                            
                                FFMPEG's xstack command results in out of sync sound, is it possible to mix the audio in a single encoding?
                            
                                Python thread dump
                            
                                ReportLab LayoutError: too large on page
                            
                                Mechanize submit login form from http to https
                            
                                Stop evaluation within a module
                            
                                How to connect to Cassandra inside a Pylons app?
                            
                                Mechanize for Python 3.x
                            
                                Calculate/validate bz2 (bzip2) CRC32 in Python
                            
                                testing interactive python programs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python lists/dictionaries vs. numpy arrays: performance vs. memory control

Tags:

performance

python

memory-management