Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merge 2 dataframes in Pandas: join on some columns, sum up others

Tags:

python

pandas

I want to merge two dataframes on specific columns (key1, key2) and sum up the values for another column (value).

>>> df1 = pd.DataFrame({'key1': range(4), 'key2': range(4), 'value': range(4)})
   key1  key2  value
0     0     0      0
1     1     1      1
2     2     2      2
3     3     3      3

>>> df2 = pd.DataFrame({'key1': range(2, 6), 'key2': range(2, 6), 'noise': range(2, 6), 'value': range(10, 14)})
   key1  key2  noise  value
0     2     2      2     10
1     3     3      3     11
2     4     4      4     12
3     5     5      5     13

I want this result:

   key1  key2  value
0     0     0      0
1     1     1      1
2     2     2     12
3     3     3     14
4     4     4     12
5     5     5     13

In SQL terms, I want:

SELECT df1.key1, df1.key2, df1.value + df2.value AS value
FROM df1 OUTER JOIN df2 ON key1, key2

I tried two approaches:

approach 1

concatenated = pd.concat([df1, df2])
grouped = concatenated.groupby(['key1', 'key2'], as_index=False)
summed = grouped.agg(np.sum)
result = summed[['key1', 'key2', 'value']]

approach 2

joined = pd.merge(df1, df2, how='outer', on=['key1', 'key2'], suffixes=['_1', '_2'])
joined = joined.fillna(0.0)
joined['value'] = joined['value_1'] + joined['value_2']
result = joined[['key1', 'key2', 'value']]

Both approaches give the result I want, but I wonder if there is a simpler way.

like image 346
Laurie Avatar asked May 16 '13 09:05

Laurie


People also ask

How can I join two Dataframes in pandas with different column names?

It is possible to join the different columns is using concat() method.

How do I join two Dataframes based on columns?

We can join columns from two Dataframes using the merge() function. This is similar to the SQL 'join' functionality. A detailed discussion of different join types is given in the SQL lesson. You specify the type of join you want using the how parameter.

How do you aggregate two Dataframes in pandas?

The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.


1 Answers

I don't know about simpler, but you can get a little more concise:

>>> pd.concat([df1, df2]).groupby(["key1", "key2"], as_index=False)["value"].sum()
   key1  key2  value
0     0     0      0
1     1     1      1
2     2     2     12
3     3     3     14
4     4     4     12
5     5     5     13

Depending on your tolerance for chaining ops, you might want to break this onto multiple lines anyway, though (four tends to be close to my upper limit, in this case concat-groupby-select-sum).

like image 125
DSM Avatar answered Nov 15 '22 16:11

DSM