Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: merge data frame but summing overlapping columns

Tags:

python

pandas

I've been reading lots of posts about the merge() and join() methods of pandas.DataFrames, and trying these on my own problem but not quite found a solution.

I have a very large data file (.csv) containing the hourly consumption of something for various IDs. I want to aggregate the consumption for each ID over each month.

Due to memory limitations I need to process the hourly consumption file with read_csv in chunks (using chunk_size option), and end up with a load of DataFrames of consumption for IDs for some months, e.g.:

df1 = 
 Month  Dec  Nov
ID             
XXX    4.0  1.0
YYY    8.0  3.0
ZZZ    4.0  1.0 

df2 = 
 Month  Dec  Nov  Oct
ID                  
AAA    1.0  7.0  9.0
BBB    0.0  NaN  2.0
YYY    5.0  5.0  0.0

Generated for this post by:

df1 = pd.DataFrame({'ID': ['XXX','XXX','YYY','YYY','ZZZ','ZZZ'], 
                    'Month': ['Nov','Dec']*3, 
                    'Consumption': [1.0,4.0,3.0,8.0,1.0,4.0]})
df1 = df1.pivot(index='ID', columns='Month', values='Consumption')
df2 = pd.DataFrame({'ID': ['AAA','AAA','AAA','YYY','YYY','YYY','BBB','BBB','BBB'], 
                    'Month': ['Oct','Nov','Dec']*3, 
                    'Consumption': [9,7,1,0,5,5,2,np.nan,0]})
df2 = df2.pivot(index='ID', columns='Month', values='Consumption')

Note that there is a difference between a consumption of 0.0 and NaN. 0.0 means there was at least one consumption reading of 0.0 in the month, but NaN means there was no consumption value logged at all and, in this case, 0 cannot be assumed. For my purposes this difference must be kept distinct.

Because the data file is processed in chunks, there are some IDs that appear in more than one DataFrame, e.g. YYY, and, for those IDs, sometimes the months overlap too, e.g. Nov for ID YYY. In this case, consumption in the first half of the month is in df1 and the second half is in df2.

To aggregate the consumption I therefore need to merge these DataFrames by 'ID' and sum values in overlapping 'Months'.

Straight summing the DataFrames yields many NaNs:

df1 + df2 = 
 Month   Dec  Nov  Oct
ID                   
AAA     NaN  NaN  NaN
BBB     NaN  NaN  NaN
XXX     NaN  NaN  NaN
YYY    13.0  8.0  NaN
ZZZ     NaN  NaN  NaN

I assume this is because when summing IDs/Months of df1 that don't appear df2 it returns a NaN.

Outer merging produces the suffixed columns for overlapping months:

df1.merge(df2,how='outer',on='ID') = 
 Month  Dec_x  Nov_x  Dec_y  Nov_y  Oct
ID                                    
XXX      4.0    1.0    NaN    NaN  NaN
YYY      8.0    3.0    5.0    5.0  0.0
ZZZ      4.0    1.0    NaN    NaN  NaN
AAA      NaN    NaN    1.0    7.0  9.0
BBB      NaN    NaN    0.0    NaN  2.0

I couldn't get combine_first to do what I want either.

What I want is something in the middle, that looks like this:

 Month   Dec  Nov  Oct
ID                   
XXX      4.0  1.0  NaN
YYY     13.0  8.0  0.0
ZZZ      4.0  1.0  NaN
AAA      1.0  7.0  9.0
BBB      0.0  NaN  2.0

Where overlapping months are summed such that x + NaN = x, NaN + y = y and NaN + NaN = NaN.


One solution I can see to do the merge, then sum the overlapping columns, ignoring NaNs:

df3 = df1.merge(df2,how='outer',on='ID',suffixes=['','_x'])
overlapping_months_sufx = df3.columns.values[df3.columns.str.endswith('_x')]
for mnth_sufx in overlapping_months_sufx:
    mnth = mnth_sufx[:-2]
    df3[mnth][df3[mnth_sufx].notnull()] = df3[mnth].fillna(0) + df3[mnth_sufx]
    df3=df3.drop(columns=mnth_sufx)
df3 = 
 Month   Dec  Nov  Oct
ID                   
XXX     4.0  1.0  NaN
YYY    13.0  8.0  0.0
ZZZ     4.0  1.0  NaN
AAA     1.0  7.0  9.0
BBB     0.0  NaN  2.0

Given the size of this data set it would be great to have the most efficient way to aggregate all this. Is there a better way to do it, perhaps in a single step?

Thanks, Chris

like image 383
Chris Browne Avatar asked Dec 11 '18 12:12

Chris Browne


2 Answers

Here's an attempt. Please leave a comment if I understood correctly.

Given:

>>> df1                                                                                                                
Month  Dec  Nov
ID             
XXX    4.0  1.0
YYY    8.0  3.0
ZZZ    4.0  1.0
>>> df2                                                                                                                
Month  Dec  Nov  Oct
ID                  
AAA    1.0  7.0  9.0
BBB    0.0  NaN  2.0
YYY    5.0  5.0  0.0

Solution:

>>> pd.concat([df1, df2]).reset_index().groupby('ID', sort=False).sum(min_count=1)
      Dec  Nov  Oct
ID                 
XXX   4.0  1.0  NaN
YYY  13.0  8.0  0.0
ZZZ   4.0  1.0  NaN
AAA   1.0  7.0  9.0
BBB   0.0  NaN  2.0

Explanation:

Concatenation just puts df2 under df1.

>>> cat = pd.concat([df1, df2])                                                                                        
>>> cat                                                                                                                
     Dec  Nov  Oct
ID                
XXX  4.0  1.0  NaN
YYY  8.0  3.0  NaN
ZZZ  4.0  1.0  NaN
AAA  1.0  7.0  9.0
BBB  0.0  NaN  2.0
YYY  5.0  5.0  0.0

reset_index moves the index into the columns.

>>> cat = cat.reset_index()                                                                                            
>>> cat                                                                                                                
    ID  Dec  Nov  Oct
0  XXX  4.0  1.0  NaN
1  YYY  8.0  3.0  NaN
2  ZZZ  4.0  1.0  NaN
3  AAA  1.0  7.0  9.0
4  BBB  0.0  NaN  2.0
5  YYY  5.0  5.0  0.0

I did this such that there is a column with the name 'ID' by which I can group the other values. groupby('ID', sort=False) creates groups of rows sharing the same value in the 'ID' column (and sort=False ensures that the rows in the final result are not sorted to match your output).

We can check the group sizes like this:

>>> cat.groupby('ID', sort=False).size()                                                                               
ID
XXX    1
YYY    2
ZZZ    1
AAA    1
BBB    1
dtype: int64

As you can see we only have one group of size two because the 'YYY' ID is the only duplicated one.

sum(min_count=1) works like this: the values in each group are summed up with respect to their column. The parameter min_count=1 ensures that a series of all NaN values results in NaN when summed up.

>>> cat.groupby('ID', sort=False).sum(min_count=1)                                                      
      Dec  Nov  Oct
ID                 
XXX   4.0  1.0  NaN
YYY  13.0  8.0  0.0
ZZZ   4.0  1.0  NaN
AAA   1.0  7.0  9.0
BBB   0.0  NaN  2.0

Demo for min_count:

>>> s = pd.Series([np.nan, np.nan])                                                                                    
>>> s                                                                                                                  
0   NaN
1   NaN
dtype: float64
>>>                                                                                                                    
>>> s.sum()                                                                                                            
0.0
>>> s.sum(min_count=1)                                                                                                 
nan
>>> s[0] = 1                                                                                                           
>>> s                                                                                                                  
0    1.0
1    NaN
dtype: float64
>>> s.sum()                                                                                                            
1.0
>>> s.sum(min_count=1)                                                                                                 
1.0
>>> s.sum(min_count=2)                                                                                                 
nan
like image 117
timgeb Avatar answered Sep 29 '22 01:09

timgeb


My approach would be to insert those values which only reside in one dataframe afterwards at those indices where the result of the sum is NaN:

result = df1 + df2

Month   Dec  Nov  Oct
ID                   
AAA     NaN  NaN  NaN
BBB     NaN  NaN  NaN
XXX     NaN  NaN  NaN
YYY    13.0  8.0  NaN
ZZZ     NaN  NaN  NaN

result = result.where(~result.isna(), df1)

Month   Dec  Nov  Oct
ID                   
AAA     NaN  NaN  NaN
BBB     NaN  NaN  NaN
XXX     4.0  1.0  NaN
YYY    13.0  8.0  NaN
ZZZ     4.0  1.0  NaN

result = result.where(~result.isna(), df2)

Month   Dec  Nov  Oct
ID                   
AAA     1.0  7.0  9.0
BBB     0.0  NaN  2.0
XXX     4.0  1.0  NaN
YYY    13.0  8.0  0.0
ZZZ     4.0  1.0  NaN
like image 33
SpghttCd Avatar answered Sep 29 '22 03:09

SpghttCd