I would like to load as much data, as is safe, so that the current process works fine as well as other processess. I would prefer to use RAM only (not using swap) but any suggestions are welcome. Excessive data can be discarded. What is the proper way of doing this? If I just wait for MemoryException, the system become not operable (if using list).
data_storage = []
for data in read_next_data():
data_storage.append(data)
The data is finally to be loaded into numpy array.
psutil has a virtual_memory function that contains, beside others, an attribute representing the free memory:
>>> psutil.virtual_memory()
svmem(total=4170924032, available=1743937536, percent=58.2, used=2426986496, free=1743937536)
>>> psutil.virtual_memory().free
1743937536
That should be pretty accurate (but the function call is costly -slow- at least on Windows). The MemoryError doesn't take memory used by other proccesses into account so it's only raised if the memory of the array exceeds the total avaiable (free or not) RAM.
You may have to guess at which point you stop accumulating because the free memory can change (other processes also need some additional memory from time to time) and the conversion to numpy.array might temporarly double your used memory because at that time the list and the array must fit into your RAM.
However you can approach this also in different way:
read_next_data().psutil.virtual_memory().freeshape of the first dataset and the dtype to calculate the shape of the array that fits easily into the RAM. Let's say it uses factor (i.e. 75%) of the avaiable free memory: rows= freeMemory * factor / (firstDataShape * memoryPerElement) that should give you the number of datasets that you read in at once.arr = np.empty((rows, *firstShape), dtype=firstDtype).arr[i] = next(read_next_data). That way you you don't keep these lists around and you avoid the doubled memory.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With