efficient array concatenation

Question

Im trying to concatenate several hundred arrays size totaling almost 25GB of data. I am testing on 56 GB machine, but i receive memory error. I reckon the way I do my precess is ineffecient and is sucking lots of memory. This is my code:

    for dirname, dirnames, filenames in os.walk('/home/extra/AllData'): 
        filenames.sort()
    BigArray=numpy.zeros((1,200))
    for file in filenames:
        newArray[load(filenames[filea])
        BigArray=numpy.concatenate((BigArrat,newArray))

any idea, thoughts or solutions?

Thanks

Aaron Digulla · Accepted Answer

Your process is horribly inefficient. When handling such huge amounts of data, you really need to know your tools.

For your problem, np.concatenate is forbidden - it needs at least twice the memory of the inputs. Plus it will copy every bit of data, so it's slow, too.

Use numpy.memmap to load the arrays. That will use only a few bytes of memory while still being pretty efficient.

Join them using np.vstack. Call this only once (i.e. don't bigArray=vstack(bigArray,newArray)!!!). Load all the arrays in a list allArrays and then call bigArray = vstack(allArrays)
If that is really too slow, you need to know the size of the array in advance, create an array of this size once and then load the data into the existing array (instead of creating a new one every time).

Depending on how often the files on disk change, it might be much more efficient to concatenate them with the OS tools to create one huge file and then load that (or use numpy.memmap)

efficient array concatenation

Tags:

performance

python

arrays

numpy

Adham Ghazali

1 Answers

Aaron Digulla

Recent Activity

Donate For Us

efficient array concatenation

Tags:

performance

python

arrays

numpy

Adham Ghazali

1 Answers

Aaron Digulla

Related questions

Recent Activity

Donate For Us