Memory error with large data sets for pandas.concat and numpy.append

Tags:

I am facing a problem where I have to generate large DataFrames in a loop (50 iterations computing every time two 2000 x 800 pandas DataFrames). I would like to keep the results in memory in a bigger DataFrame, or in a dictionary like structure. When using pandas.concat, I get a memory error at some point in the loop. The same happens when using numpy.append to store the results in a dictionary of numpy arrays rather than in a DataFrame. In both cases, I still have a lot of available memory (several GB). Is this too much data for pandas or numpy to process? Are there more memory-efficient ways to store my data without saving it on disk?

As an example, the following script fails as soon as nbIds is greater than 376:

import pandas as pd
import numpy as np
nbIds = 376
dataids = range(nbIds)
dataCollection1 = []
dataCollection2 = []
for bs in range(50):
    newData1 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection1.append( newData1 )
    newData2 = pd.DataFrame( np.reshape(np.random.uniform(size = 
                                                          2000 * len(dataids)), 
                                        (2000,len(dataids ))))
    dataCollection2.append( newData2 )
dataCollection1 = pd.concat(dataCollection1).reset_index(drop = True)
dataCollection2 = pd.concat(dataCollection2).reset_index(drop = True)

The code below fails when nbIds is 665 or higher

import pandas as pd
import numpy as np
nbIds = 665
dataids = range(nbIds)
dataCollection1 = dict( (i , np.array([])) for i in dataids )
dataCollection2 = dict( (i , np.array([])) for i in dataids )
for bs in range(50):
    newData1 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids )))
    newData1 = pd.DataFrame(newData1)
    newData2 = np.reshape(np.random.uniform(size = 2000 * len(dataids)), 
                         (2000,len(dataids)))
    newData2 = pd.DataFrame(newData2)
    for i in dataids :
        dataCollection1[i] = np.append(dataCollection1[i] , 
                                       np.array(newData1[i]))
        dataCollection2[i] = np.append(dataCollection2[i] , 
                                       np.array(newData2[i]))

I do need to compute both DataFrames everytime, and for each element i of dataids I need to obtain a pandas Series or a numpy array containing the 50 * 2000 numbers generated for i. Ideally, I need to be able to run this with nbIds equal to 800 or more. Is there a straightforward way of doing this?

I am using a 32-bit Python with Python 2.7.5, pandas 0.12.0 and numpy 1.7.1.

Thank you very much for your help!

908

asked Oct 25 '13 13:10

Vidac

1 Answers

This is essentially what you are doing. Note that it doesn't make much difference from a memory perspective if you do conversition to DataFrames before or after.

But you can specify dtype='float32' to effectively 1/2 your memory.

In [45]: np.concatenate([ np.random.uniform(size=2000 * 1000).astype('float32').reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[45]: 400000000

In [46]: np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[46]: 800000000

In [47]: DataFrame(np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]))
Out[47]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Columns: 1000 entries, 0 to 999
dtypes: float64(1000)

answered Oct 04 '22 06:10

Jeff

Related questions
                            
                                passing current object to python apscheduler method
                            
                                Using Python to Know When a File Has Completely Been Received From an FTP Source
                            
                                How to compile Python 2.4.6 with ssl, readline and zlib on Debian Lenny
                            
                                Python kbhit() problems
                            
                                Put HTML into ValidationError in Django
                            
                                TypeError: int is not callable [duplicate]
                            
                                How to change UserDoesNotExist SELECT behavior in Flask-peewee - python & mysql
                            
                                Accessing consecutive items when using a generator
                            
                                pow() raising to floats
                            
                                Fabric log format to display date and time
                            
                                How to replace a tag with space Beautiful Soup
                            
                                creating and executing a Javascript function with Selenium
                            
                                How to create a Triangle shaped drawing from my variables in Python
                            
                                About variable scope?
                            
                                Python cancel raw_input/input via writing to stdin?
                            
                                elegant way to filter list
                            
                                Does range have to calculate all previous values when using a index
                            
                                Get live stdout from gevent-subprocess?
                            
                                Regular expression match starting at the end of a Sublime Text "view"
                            
                                Passing a parameter through a python filter function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory error with large data sets for pandas.concat and numpy.append

Tags:

python

pandas

numpy

python-2.7

Vidac

People also ask

1 Answers

Jeff

Recent Activity

Donate For Us