Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python lists/dictionaries vs. numpy arrays: performance vs. memory control

I have to iteratively read data files and store the data into (numpy) arrays. I chose to store the data into a dictionary of "data fields": {'field1': array1,'field2': array2,...}.

Case 1 (lists):

Using lists (or collections.deque()) for "appending" new data arrays, the code is efficient. But, when I concatenate the arrays stored in the lists, the memory grows and I did not manage to free it again. Example:

filename = 'test'
# data file with a matrix of shape (98, 56)
nFields = 56
# Initialize data dictionary and list of fields
dataDict = {}

# data directory: each entry contains a list 
field_names = []
for i in xrange(nFields):
    field_names.append(repr(i))
    dataDict[repr(i)] = []

# Read a data file N times (it represents N files reading)
# file contains 56 fields of arbitrary length in the example
# Append each time the data fields to the lists (in the data dictionary)
N = 10000
for j in xrange(N):
    xy = np.loadtxt(filename)
    for i,field in enumerate(field_names):
        dataDict[field].append(xy[:,i])

# concatenate list members (arrays) to a numpy array 
for key,value in dataDict.iteritems():
    dataDict[key] = np.concatenate(value,axis=0)

Computing time: 63.4 s
Memory usage (top): 13862 gime_se 20 0 1042m 934m 4148 S 0 5.8 1:00.44 python

Case 2 (numpy arrays):

Concatenating directly the numpy arrays each time they are readed, it is inefficient but memory remains under control. Example:

nFields = 56
dataDict = {}
# data directory: each entry contains a list 
field_names = []
for i in xrange(nFields):
    field_names.append(repr(i))
    dataDict[repr(i)] = np.array([])

# Read a data file N times (it represents N files reading)
# Concatenate data fields to numpy arrays (in the data dictionary)
N = 10000
for j in xrange(N):
    xy = np.loadtxt(filename)
    for i,field in enumerate(field_names):
        dataDict[field] = np.concatenate((dataDict[field],xy[:,i])) 

Computing time: 1377.8 s
Memory usage (top): 14850 gime_se 20 0 650m 542m 4144 S 0 3.4 22:31.21 python

Question(s):

  • Is there any way of having the performance of Case 1 but keeping the memory under control as in Case 2?

  • It seems that in case 1, the memory grows when concatenating list members (np.concatenate(value,axis=0)). Better ideas of doing it?

like image 280
chan gimeno Avatar asked Feb 08 '11 16:02

chan gimeno


People also ask

Are NumPy arrays more memory efficient than lists?

NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.

Is dictionary faster than NumPy array?

Also as expected, the Numpy array performed faster than the dictionary.

Are Python lists or NumPy arrays faster?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.

How much faster is NumPy array than list?

As the array size increase, Numpy gets around 30 times faster than Python List. Because the Numpy array is densely packed in memory due to its homogeneous type, it also frees the memory faster.


1 Answers

Here's what is going on based on what I've observed. There isn't really a memory leak. Instead, Python's memory management code (possibly in connection with the memory management of whatever OS you are in) is deciding to keep the space used by the original dictionary (the one without the concatenated arrays) in the program. However, it is free to be reused. I proved this by doing the following:

  1. Making the code you gave as an answer into a function that returned dataDict.
  2. Calling the function twice and assigning the results to two different variables.

When I do this, I find that the amount of memory used only increased from ~900 GB to ~1.3 GB. Without the extra dictionary memory, the Numpy data itself should take up about 427 MB by my calculations so this adds up. The second initial, unconcatenated dictionary that our function created just used the already allocated memory.

If you really can't use more than ~600 MB of memory, then I would recommend doing with your Numpy arrays somewhat like what is done internally with Python lists: allocate an array with a certain number of columns and when you've used those up, create an enlarged array with more columns and copy the data over. This will reduce the number of concatenations, meaning it will be faster (though still not as fast as lists), while keeping the memory used down. Of course, it is also more of a pain to implement.

like image 71
Justin Peel Avatar answered Oct 20 '22 17:10

Justin Peel