Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Pandas Concatenation (pandas.concat) so Memory Inefficient?

I have about 30 GB of data (in a list of about 900 dataframes) that I am attempting to concatenate together. The machine I am working with is a moderately powerful Linux Box with about 256 GB of ram. However, when I try to concatenate my files I quickly run out of available ram. I have tried all sorts of workarounds to fix this (concatenating in smaller batches with for loops, etc.) but I still cannot get these to concatenate. Two questions spring to mind:

  1. Has anyone else dealt with this and found an effective workaround? I cannot use a straight append because I need the 'column merging' (for lack of a better word) functionality of the join='outer' argument in pd.concat().

  2. Why is Pandas concatenation (which I know is just calling numpy.concatenate) so inefficient with its use of memory?

I should also note that I do not think the problem is an explosion of columns as concatenating 100 of the dataframes together gives about 3000 columns whereas the base dataframe has about 1000.

Edit:

The data I am working with is financial data about 1000 columns wide and about 50,000 rows deep for each of my 900 dataframes. The types of data going across left to right are:

  1. date in string format,
  2. string
  3. np.float
  4. int

... and so on repeating. I am concatenating on column name with an outer join which means that any columns in df2 that are not in df1 will not be discarded but shunted off to the side.


Example:

 #example code
 data=pd.concat(datalist4, join="outer", axis=0, ignore_index=True)
 #two example dataframes (about 90% of the column names should be in common
 #between the two dataframes, the unnamed columns, etc are not a significant
 #number of the columns)

print datalist4[0].head()
                800_1     800_2   800_3  800_4               900_1     900_2  0 2014-08-06 09:00:00  BEST_BID  1117.1    103 2014-08-06 09:00:00  BEST_BID   
1 2014-08-06 09:00:00  BEST_ASK  1120.0    103 2014-08-06 09:00:00  BEST_ASK   
2 2014-08-06 09:00:00  BEST_BID  1106.9     11 2014-08-06 09:00:00  BEST_BID   
3 2014-08-06 09:00:00  BEST_ASK  1125.8     62 2014-08-06 09:00:00  BEST_ASK   
4 2014-08-06 09:00:00  BEST_BID  1117.1    103 2014-08-06 09:00:00  BEST_BID   

    900_3  900_4              1000_1    1000_2    ...     2400_4  0  1017.2    103 2014-08-06 09:00:00  BEST_BID    ...        NaN   
1  1020.1    103 2014-08-06 09:00:00  BEST_ASK    ...        NaN   
2  1004.3     11 2014-08-06 09:00:00  BEST_BID    ...        NaN   
3  1022.9     11 2014-08-06 09:00:00  BEST_ASK    ...        NaN   
4  1006.7     10 2014-08-06 09:00:00  BEST_BID    ...        NaN   

                      _1  _2  _3  _4                   _1.1 _2.1 _3.1  _4.1  0  #N/A Invalid Security NaN NaN NaN  #N/A Invalid Security  NaN  NaN   NaN   
1                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
2                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
3                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   
4                    NaN NaN NaN NaN                    NaN  NaN  NaN   NaN   

      dater  
0  2014.8.6  
1  2014.8.6  
2  2014.8.6  
3  2014.8.6  
4  2014.8.6  

[5 rows x 777 columns]

print datalist4[1].head()
                150_1     150_2   150_3  150_4               200_1     200_2  0 2013-12-04 09:00:00  BEST_BID  1639.6     30 2013-12-04 09:00:00  BEST_ASK   
1 2013-12-04 09:00:00  BEST_ASK  1641.8    133 2013-12-04 09:00:08  BEST_BID   
2 2013-12-04 09:00:01  BEST_BID  1639.5     30 2013-12-04 09:00:08  BEST_ASK   
3 2013-12-04 09:00:05  BEST_BID  1639.4     30 2013-12-04 09:00:08  BEST_ASK   
4 2013-12-04 09:00:08  BEST_BID  1639.3    133 2013-12-04 09:00:08  BEST_BID   

    200_3  200_4               250_1     250_2    ...                 2500_1  0  1591.9    133 2013-12-04 09:00:00  BEST_BID    ...    2013-12-04 10:29:41   
1  1589.4     30 2013-12-04 09:00:00  BEST_ASK    ...    2013-12-04 11:59:22   
2  1591.6    103 2013-12-04 09:00:01  BEST_BID    ...    2013-12-04 11:59:23   
3  1591.6    133 2013-12-04 09:00:04  BEST_BID    ...    2013-12-04 11:59:26   
4  1589.4    133 2013-12-04 09:00:07  BEST_BID    ...    2013-12-04 11:59:29   

     2500_2 2500_3 2500_4         Unnamed: 844_1  Unnamed: 844_2  0  BEST_ASK   0.35     50  #N/A Invalid Security             NaN   
1  BEST_ASK   0.35     11                    NaN             NaN   
2  BEST_ASK   0.40     11                    NaN             NaN   
3  BEST_ASK   0.45     11                    NaN             NaN   
4  BEST_ASK   0.50     21                    NaN             NaN   

  Unnamed: 844_3 Unnamed: 844_4         Unnamed: 848_1      dater  
0            NaN            NaN  #N/A Invalid Security  2013.12.4  
1            NaN            NaN                    NaN  2013.12.4  
2            NaN            NaN                    NaN  2013.12.4  
3            NaN            NaN                    NaN  2013.12.4  
4            NaN            NaN                    NaN  2013.12.4  

[5 rows x 850 columns]
like image 666
sfortney Avatar asked Apr 22 '15 19:04

sfortney


2 Answers

I've had performance issues concatenating a large number of DataFrames to a 'growing' DataFrame. My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed.

like image 132
Alexander Avatar answered Dec 08 '22 00:12

Alexander


Looks like you are trying to row-wise concat, even though you text indicates that you what column-wise. Specify axis=1.

Other points to consider:

copy=False flag will not help at all; this only matters if you are not concatting blocks of the same dtype (which you indicated you are).

pd.concat does use np.concatenate under the hood. If you think you can do better, then go for it.

def make_frames(n=100, rows=100, cols=100):
    return [ pd.DataFrame(np.random.randn(rows,cols),columns=np.random.choice(110,100,replace=False)) for i in xrange(n) ]

In [28]: l = make_frames(rows=10000)

In [29]: l[0].head()
Out[29]: 
        60        75        101       103       87        29        10        106       71        26        30        83        2         28        99        85        88        62        58        18        42        1         105       25        34     ...          102       27        22   \
0 -0.854117 -0.007549 -0.510359 -0.993757  0.877635 -0.303199 -1.488548  1.179360  0.578095  0.807792  0.169930 -1.781403  0.204696 -0.515057 -0.954246  1.106073  0.666516 -1.146988  1.335709  0.362838 -0.675379  1.483469  0.670385 -0.483312 -0.703795    ...     1.322645 -1.942183  1.053502   
1  2.057542  0.860946 -0.037665 -0.347265  0.152562 -0.859537  1.431045  1.306419  0.623013  1.192325  0.909597  1.710507  1.319330 -0.402874  1.749581  1.223489  0.036354  0.140255  0.844330 -0.091447 -0.347245  0.259055  1.187882 -0.216858 -1.421336    ...     1.122068  0.887538  0.205854   
2 -0.077974  0.947503  0.688666  0.288104 -1.275329 -0.840847 -2.014090 -1.318507 -0.889416 -0.098005  0.055492  0.847597 -1.289428 -0.910093  0.201312 -1.699879  0.103062 -1.041608  0.379171 -1.089937  0.894626 -1.500215 -0.501182  0.042078 -0.840789    ...     0.539192  0.193256  0.196138   
3  0.291993  1.138577  1.061509  0.856553  1.118931  0.725806 -0.689776  1.337957 -1.009835 -0.976506 -0.392317  0.295876  0.092240  0.418201  0.473585  0.013809 -1.169947  0.424797  0.019051 -0.526189  0.066991 -0.268750  1.277004 -0.736560 -0.314987    ...     0.272045 -0.333272  0.573267   
4 -2.073985 -0.016950 -1.712770  0.286212 -0.159693 -0.495864  1.286450 -1.168880  1.031456 -3.080568  1.443880 -0.604405  0.406383 -0.162986  1.077255  1.160726  0.943949 -1.517681 -1.049972  1.208850 -0.859617 -0.145358 -0.638898  0.248012 -2.985845    ...    -0.699697  0.051352  0.575304   

        69        76        91        45        14        37        0         81        38        72        107       11        5         73        70        8         90        94        53        3         55        12   
0 -0.972965 -0.298674  1.283482  2.344092 -0.597735 -0.407978  0.971726 -0.935620  0.236889 -0.957096 -2.366399 -0.943760  0.293325 -0.240385 -0.392554 -0.887556  0.261402 -2.050122 -1.776865 -1.513899 -0.953916  0.630495  
1 -1.471033  0.269830 -0.744507 -0.982779  0.624527 -1.782704  1.197262 -0.297730  1.122939 -1.039226  0.171351 -0.828985  0.698245  0.563430  0.718177  0.682369  1.415918  0.049931  0.648000  1.785455 -0.190021 -1.329753  
2 -1.942792  0.560981 -0.353782 -1.637407 -1.495131 -0.593041 -1.617116 -0.910257 -0.506877  0.178378 -0.623986  0.302544  0.279309 -0.266409  0.780306  0.986510 -1.549847  0.063632 -0.480434  1.393221 -1.237682  1.577320  
3  0.468151 -1.002872 -0.147329 -0.420609  0.183696  0.527632  0.018911 -2.059989  1.642613 -0.428345  1.350693 -1.323321 -0.247263  0.331525 -2.036862 -2.593575  0.362101 -0.184095  0.419231 -0.633878  0.097499 -0.026044  
4 -0.581330 -0.848421 -0.682027 -1.260004 -0.357354 -0.304743  0.409537 -1.189925 -0.609352 -0.610345 -0.798009  0.219822 -0.681764  1.872736  1.738017  0.439148  1.012881 -0.934613 -1.007427 -0.390359  0.329949  0.486906  

[5 rows x 100 columns]

Concat, note using axis=1 as this is column-wise concat.

In [31]: df = pd.concat(l,axis=1,ignore_index=True)

In [32]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Columns: 10000 entries, 0 to 9999
dtypes: float64(10000)
memory usage: 763.0 MB

Timings

In [33]: %timeit pd.concat(l,axis=1,ignore_index=True)
1 loops, best of 3: 1.15 s per loop

In [34]: %memit pd.concat(l,axis=1,ignore_index=True)
peak memory: 2390.25 MiB, increment: 651.28 MiB
like image 29
Jeff Avatar answered Dec 08 '22 00:12

Jeff