Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficient array concatenation

Im trying to concatenate several hundred arrays size totaling almost 25GB of data. I am testing on 56 GB machine, but i receive memory error. I reckon the way I do my precess is ineffecient and is sucking lots of memory. This is my code:

    for dirname, dirnames, filenames in os.walk('/home/extra/AllData'): 
        filenames.sort()
    BigArray=numpy.zeros((1,200))
    for file in filenames:
        newArray[load(filenames[filea])
        BigArray=numpy.concatenate((BigArrat,newArray))

any idea, thoughts or solutions?

Thanks

like image 357
Adham Ghazali Avatar asked Oct 21 '22 21:10

Adham Ghazali


1 Answers

Your process is horribly inefficient. When handling such huge amounts of data, you really need to know your tools.

For your problem, np.concatenate is forbidden - it needs at least twice the memory of the inputs. Plus it will copy every bit of data, so it's slow, too.

  1. Use numpy.memmap to load the arrays. That will use only a few bytes of memory while still being pretty efficient.

    Join them using np.vstack. Call this only once (i.e. don't bigArray=vstack(bigArray,newArray)!!!). Load all the arrays in a list allArrays and then call bigArray = vstack(allArrays)

  2. If that is really too slow, you need to know the size of the array in advance, create an array of this size once and then load the data into the existing array (instead of creating a new one every time).

    Depending on how often the files on disk change, it might be much more efficient to concatenate them with the OS tools to create one huge file and then load that (or use numpy.memmap)

like image 137
Aaron Digulla Avatar answered Oct 30 '22 22:10

Aaron Digulla