Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Pandas to create DataFrame with Series, resulting in memory error

Tags:

I'm using Pandas library for remote sensing time series analysis. Eventually I would like to save my DataFrame to csv by using chunk-sizes, but I run into a little issue. My code generates 6 NumPy arrays that I convert to Pandas Series. Each of these Series contains a lot of items

>>> prcpSeries.shape
(12626172,)

I would like to add the Series into a Pandas DataFrame (df) so I can save them chunk by chunk to a csv file.

d = {'prcp': pd.Series(prcpSeries),
     'tmax': pd.Series(tmaxSeries),
     'tmin': pd.Series(tminSeries),
     'ndvi': pd.Series(ndviSeries),
     'lstm': pd.Series(lstmSeries),
     'evtm': pd.Series(evtmSeries)}

df = pd.DataFrame(d)
outFile ='F:/data/output/run1/_'+str(i)+'.out'
df.to_csv(outFile, header = False, chunksize = 1000)
d = None
df = None

But my code get stuck at following line giving a Memory Error

df = pd.DataFrame(d)

Any suggestions? Is it possible to fill the Pandas DataFrame chunk by chunk?

like image 447
Mattijn Avatar asked Jun 18 '13 09:06

Mattijn


1 Answers

If you know each of these are the same length then you could create the DataFrame directly from the array and then append each column:

df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
...

Note: you can also use the to_frame method (which allows you to (optionally) pass a name - which is useful if the Series doesn't have one):

df = prcpSeries.to_frame(name='prcp')

However, if they are variable length then this will lose some data (any arrays which are longer than prcpSeries). An alternative here is to create each as a DataFrame and then perform an outer join (using concat):

df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...

df = pd.concat([df1, df2, ...], join='outer', axis=1)

For example:

In [21]: dfA = pd.DataFrame([1,2], columns=['A'])

In [22]: dfB = pd.DataFrame([1], columns=['B'])

In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
   A   B
0  1   1
1  2 NaN
like image 86
Andy Hayden Avatar answered Sep 19 '22 14:09

Andy Hayden