Im trying to concatenate several hundred arrays size totaling almost 25GB of data. I am testing on 56 GB machine, but i receive memory error. I reckon the way I do my precess is ineffecient and is sucking lots of memory. This is my code:
for dirname, dirnames, filenames in os.walk('/home/extra/AllData'):
filenames.sort()
BigArray=numpy.zeros((1,200))
for file in filenames:
newArray[load(filenames[filea])
BigArray=numpy.concatenate((BigArrat,newArray))
any idea, thoughts or solutions?
Thanks
Your process is horribly inefficient. When handling such huge amounts of data, you really need to know your tools.
For your problem, np.concatenate
is forbidden - it needs at least twice the memory of the inputs. Plus it will copy every bit of data, so it's slow, too.
Use numpy.memmap to load the arrays. That will use only a few bytes of memory while still being pretty efficient.
Join them using np.vstack
. Call this only once (i.e. don't bigArray=vstack(bigArray,newArray)
!!!). Load all the arrays in a list allArrays
and then call bigArray = vstack(allArrays)
If that is really too slow, you need to know the size of the array in advance, create an array of this size once and then load the data into the existing array (instead of creating a new one every time).
Depending on how often the files on disk change, it might be much more efficient to concatenate them with the OS tools to create one huge file and then load that (or use numpy.memmap)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With