I have to iteratively read data files and store the data into (numpy) arrays. I chose to store the data into a dictionary of "data fields": {'field1': array1,'field2': array2,...}
.
Using lists (or collections.deque()
) for "appending" new data arrays, the code is efficient. But, when I concatenate the arrays stored in the lists, the memory grows and I did not manage to free it again. Example:
filename = 'test'
# data file with a matrix of shape (98, 56)
nFields = 56
# Initialize data dictionary and list of fields
dataDict = {}
# data directory: each entry contains a list
field_names = []
for i in xrange(nFields):
field_names.append(repr(i))
dataDict[repr(i)] = []
# Read a data file N times (it represents N files reading)
# file contains 56 fields of arbitrary length in the example
# Append each time the data fields to the lists (in the data dictionary)
N = 10000
for j in xrange(N):
xy = np.loadtxt(filename)
for i,field in enumerate(field_names):
dataDict[field].append(xy[:,i])
# concatenate list members (arrays) to a numpy array
for key,value in dataDict.iteritems():
dataDict[key] = np.concatenate(value,axis=0)
Computing time: 63.4 s
Memory usage (top): 13862 gime_se 20 0 1042m 934m 4148 S 0 5.8 1:00.44 python
Concatenating directly the numpy arrays each time they are readed, it is inefficient but memory remains under control. Example:
nFields = 56
dataDict = {}
# data directory: each entry contains a list
field_names = []
for i in xrange(nFields):
field_names.append(repr(i))
dataDict[repr(i)] = np.array([])
# Read a data file N times (it represents N files reading)
# Concatenate data fields to numpy arrays (in the data dictionary)
N = 10000
for j in xrange(N):
xy = np.loadtxt(filename)
for i,field in enumerate(field_names):
dataDict[field] = np.concatenate((dataDict[field],xy[:,i]))
Computing time: 1377.8 s
Memory usage (top): 14850 gime_se 20 0 650m 542m 4144 S 0 3.4 22:31.21 python
Is there any way of having the performance of Case 1 but keeping the memory under control as in Case 2?
It seems that in case 1, the memory grows when concatenating list members (np.concatenate(value,axis=0)
). Better ideas of doing it?
NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.
Also as expected, the Numpy array performed faster than the dictionary.
NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.
As the array size increase, Numpy gets around 30 times faster than Python List. Because the Numpy array is densely packed in memory due to its homogeneous type, it also frees the memory faster.
Here's what is going on based on what I've observed. There isn't really a memory leak. Instead, Python's memory management code (possibly in connection with the memory management of whatever OS you are in) is deciding to keep the space used by the original dictionary (the one without the concatenated arrays) in the program. However, it is free to be reused. I proved this by doing the following:
When I do this, I find that the amount of memory used only increased from ~900 GB to ~1.3 GB. Without the extra dictionary memory, the Numpy data itself should take up about 427 MB by my calculations so this adds up. The second initial, unconcatenated dictionary that our function created just used the already allocated memory.
If you really can't use more than ~600 MB of memory, then I would recommend doing with your Numpy arrays somewhat like what is done internally with Python lists: allocate an array with a certain number of columns and when you've used those up, create an enlarged array with more columns and copy the data over. This will reduce the number of concatenations, meaning it will be faster (though still not as fast as lists), while keeping the memory used down. Of course, it is also more of a pain to implement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With