I am facing a problem where I have to generate large DataFrames in a loop (50 iterations computing every time two 2000 x 800 pandas DataFrames). I would like to keep the results in memory in a bigger DataFrame, or in a dictionary like structure. When using pandas.concat, I get a memory error at some point in the loop. The same happens when using numpy.append to store the results in a dictionary of numpy arrays rather than in a DataFrame. In both cases, I still have a lot of available memory (several GB). Is this too much data for pandas or numpy to process? Are there more memory-efficient ways to store my data without saving it on disk?
As an example, the following script fails as soon as nbIds
is greater than 376:
import pandas as pd
import numpy as np
nbIds = 376
dataids = range(nbIds)
dataCollection1 = []
dataCollection2 = []
for bs in range(50):
newData1 = pd.DataFrame( np.reshape(np.random.uniform(size =
2000 * len(dataids)),
(2000,len(dataids ))))
dataCollection1.append( newData1 )
newData2 = pd.DataFrame( np.reshape(np.random.uniform(size =
2000 * len(dataids)),
(2000,len(dataids ))))
dataCollection2.append( newData2 )
dataCollection1 = pd.concat(dataCollection1).reset_index(drop = True)
dataCollection2 = pd.concat(dataCollection2).reset_index(drop = True)
The code below fails when nbIds
is 665 or higher
import pandas as pd
import numpy as np
nbIds = 665
dataids = range(nbIds)
dataCollection1 = dict( (i , np.array([])) for i in dataids )
dataCollection2 = dict( (i , np.array([])) for i in dataids )
for bs in range(50):
newData1 = np.reshape(np.random.uniform(size = 2000 * len(dataids)),
(2000,len(dataids )))
newData1 = pd.DataFrame(newData1)
newData2 = np.reshape(np.random.uniform(size = 2000 * len(dataids)),
(2000,len(dataids)))
newData2 = pd.DataFrame(newData2)
for i in dataids :
dataCollection1[i] = np.append(dataCollection1[i] ,
np.array(newData1[i]))
dataCollection2[i] = np.append(dataCollection2[i] ,
np.array(newData2[i]))
I do need to compute both DataFrames everytime, and for each element i
of dataids
I need to obtain a pandas Series or a numpy array containing the 50 * 2000 numbers generated for i
. Ideally, I need to be able to run this with nbIds
equal to 800 or more.
Is there a straightforward way of doing this?
I am using a 32-bit Python with Python 2.7.5, pandas 0.12.0 and numpy 1.7.1.
Thank you very much for your help!
Ways to optimize memory in Pandas Instead, we can downcast the data types. Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage. By converting the data types without any compromises we can directly cut the memory usage to near half.
Appropriate Python Set-up This simplest but possibly least intuitive solution to a MemoryError actually has to do with a potential issue with your Python setup. In the event that you have installed the 32-bit version of Python on a 64-bit system, you will have extremely limited access to the system's memory.
Use efficient datatypesThe default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.
In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version.
This is essentially what you are doing. Note that it doesn't make much difference from a memory perspective if you do conversition to DataFrames before or after.
But you can specify dtype='float32' to effectively 1/2 your memory.
In [45]: np.concatenate([ np.random.uniform(size=2000 * 1000).astype('float32').reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[45]: 400000000
In [46]: np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]).nbytes
Out[46]: 800000000
In [47]: DataFrame(np.concatenate([ np.random.uniform(size=2000 * 1000).reshape(2000,1000) for i in xrange(50) ]))
Out[47]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Columns: 1000 entries, 0 to 999
dtypes: float64(1000)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With