I've been reading lots of posts about the merge()
and join()
methods of pandas.DataFrames
, and trying these on my own problem but not quite found a solution.
I have a very large data file (.csv) containing the hourly consumption of something for various IDs. I want to aggregate the consumption for each ID over each month.
Due to memory limitations I need to process the hourly consumption file with read_csv
in chunks (using chunk_size
option), and end up with a load of DataFrames of consumption for IDs for some months, e.g.:
df1 =
Month Dec Nov
ID
XXX 4.0 1.0
YYY 8.0 3.0
ZZZ 4.0 1.0
df2 =
Month Dec Nov Oct
ID
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
YYY 5.0 5.0 0.0
Generated for this post by:
df1 = pd.DataFrame({'ID': ['XXX','XXX','YYY','YYY','ZZZ','ZZZ'],
'Month': ['Nov','Dec']*3,
'Consumption': [1.0,4.0,3.0,8.0,1.0,4.0]})
df1 = df1.pivot(index='ID', columns='Month', values='Consumption')
df2 = pd.DataFrame({'ID': ['AAA','AAA','AAA','YYY','YYY','YYY','BBB','BBB','BBB'],
'Month': ['Oct','Nov','Dec']*3,
'Consumption': [9,7,1,0,5,5,2,np.nan,0]})
df2 = df2.pivot(index='ID', columns='Month', values='Consumption')
Note that there is a difference between a consumption of 0.0
and NaN
. 0.0
means there was at least one consumption reading of 0.0
in the month, but NaN
means there was no consumption value logged at all and, in this case, 0 cannot be assumed. For my purposes this difference must be kept distinct.
Because the data file is processed in chunks, there are some IDs that appear in more than one DataFrame, e.g. YYY
, and, for those IDs, sometimes the months overlap too, e.g. Nov
for ID YYY
. In this case, consumption in the first half of the month is in df1
and the second half is in df2
.
To aggregate the consumption I therefore need to merge these DataFrames by 'ID' and sum values in overlapping 'Months'.
Straight summing the DataFrames yields many NaNs:
df1 + df2 =
Month Dec Nov Oct
ID
AAA NaN NaN NaN
BBB NaN NaN NaN
XXX NaN NaN NaN
YYY 13.0 8.0 NaN
ZZZ NaN NaN NaN
I assume this is because when summing IDs/Months of df1
that don't appear df2
it returns a NaN.
Outer merging produces the suffixed columns for overlapping months:
df1.merge(df2,how='outer',on='ID') =
Month Dec_x Nov_x Dec_y Nov_y Oct
ID
XXX 4.0 1.0 NaN NaN NaN
YYY 8.0 3.0 5.0 5.0 0.0
ZZZ 4.0 1.0 NaN NaN NaN
AAA NaN NaN 1.0 7.0 9.0
BBB NaN NaN 0.0 NaN 2.0
I couldn't get combine_first
to do what I want either.
What I want is something in the middle, that looks like this:
Month Dec Nov Oct
ID
XXX 4.0 1.0 NaN
YYY 13.0 8.0 0.0
ZZZ 4.0 1.0 NaN
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
Where overlapping months are summed such that x + NaN = x
, NaN + y = y
and NaN + NaN = NaN
.
One solution I can see to do the merge, then sum the overlapping columns, ignoring NaNs:
df3 = df1.merge(df2,how='outer',on='ID',suffixes=['','_x'])
overlapping_months_sufx = df3.columns.values[df3.columns.str.endswith('_x')]
for mnth_sufx in overlapping_months_sufx:
mnth = mnth_sufx[:-2]
df3[mnth][df3[mnth_sufx].notnull()] = df3[mnth].fillna(0) + df3[mnth_sufx]
df3=df3.drop(columns=mnth_sufx)
df3 =
Month Dec Nov Oct
ID
XXX 4.0 1.0 NaN
YYY 13.0 8.0 0.0
ZZZ 4.0 1.0 NaN
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
Given the size of this data set it would be great to have the most efficient way to aggregate all this. Is there a better way to do it, perhaps in a single step?
Thanks, Chris
Here's an attempt. Please leave a comment if I understood correctly.
Given:
>>> df1
Month Dec Nov
ID
XXX 4.0 1.0
YYY 8.0 3.0
ZZZ 4.0 1.0
>>> df2
Month Dec Nov Oct
ID
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
YYY 5.0 5.0 0.0
Solution:
>>> pd.concat([df1, df2]).reset_index().groupby('ID', sort=False).sum(min_count=1)
Dec Nov Oct
ID
XXX 4.0 1.0 NaN
YYY 13.0 8.0 0.0
ZZZ 4.0 1.0 NaN
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
Explanation:
Concatenation just puts df2
under df1
.
>>> cat = pd.concat([df1, df2])
>>> cat
Dec Nov Oct
ID
XXX 4.0 1.0 NaN
YYY 8.0 3.0 NaN
ZZZ 4.0 1.0 NaN
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
YYY 5.0 5.0 0.0
reset_index
moves the index into the columns.
>>> cat = cat.reset_index()
>>> cat
ID Dec Nov Oct
0 XXX 4.0 1.0 NaN
1 YYY 8.0 3.0 NaN
2 ZZZ 4.0 1.0 NaN
3 AAA 1.0 7.0 9.0
4 BBB 0.0 NaN 2.0
5 YYY 5.0 5.0 0.0
I did this such that there is a column with the name 'ID'
by which I can group the other values. groupby('ID', sort=False)
creates groups of rows sharing the same value in the 'ID'
column (and sort=False
ensures that the rows in the final result are not sorted to match your output).
We can check the group sizes like this:
>>> cat.groupby('ID', sort=False).size()
ID
XXX 1
YYY 2
ZZZ 1
AAA 1
BBB 1
dtype: int64
As you can see we only have one group of size two because the 'YYY'
ID is the only duplicated one.
sum(min_count=1)
works like this: the values in each group are summed up with respect to their column. The parameter min_count=1
ensures that a series of all NaN
values results in NaN
when summed up.
>>> cat.groupby('ID', sort=False).sum(min_count=1)
Dec Nov Oct
ID
XXX 4.0 1.0 NaN
YYY 13.0 8.0 0.0
ZZZ 4.0 1.0 NaN
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
Demo for min_count
:
>>> s = pd.Series([np.nan, np.nan])
>>> s
0 NaN
1 NaN
dtype: float64
>>>
>>> s.sum()
0.0
>>> s.sum(min_count=1)
nan
>>> s[0] = 1
>>> s
0 1.0
1 NaN
dtype: float64
>>> s.sum()
1.0
>>> s.sum(min_count=1)
1.0
>>> s.sum(min_count=2)
nan
My approach would be to insert those values which only reside in one dataframe afterwards at those indices where the result of the sum is NaN:
result = df1 + df2
Month Dec Nov Oct
ID
AAA NaN NaN NaN
BBB NaN NaN NaN
XXX NaN NaN NaN
YYY 13.0 8.0 NaN
ZZZ NaN NaN NaN
result = result.where(~result.isna(), df1)
Month Dec Nov Oct
ID
AAA NaN NaN NaN
BBB NaN NaN NaN
XXX 4.0 1.0 NaN
YYY 13.0 8.0 NaN
ZZZ 4.0 1.0 NaN
result = result.where(~result.isna(), df2)
Month Dec Nov Oct
ID
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
XXX 4.0 1.0 NaN
YYY 13.0 8.0 0.0
ZZZ 4.0 1.0 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With