Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas join/merge/concat two DataFrames and combine rows of identical key/index [duplicate]

I am attempting to combine two sets of data, but I can't figure out which method is most suitable (join, merge, concat, etc.) for this application, and the documentation doesn't have any examples that do what I need to do.

I have two sets of data, structured like so:

>>> A
Time     Voltage
1.0      5.1
2.0      5.5
3.0      5.3
4.0      5.4
5.0      5.0

>>> B
Time     Current
-1.0     0.5
0.0      0.6
1.0      0.3
2.0      0.4
3.0      0.7

I would like to combine the data columns and merge the 'Time' column together so that I get the following:

>>> AB
Time     Voltage     Current
-1.0                 0.5
0.0                  0.6
1.0      5.1         0.3
2.0      5.5         0.4
3.0      5.3         0.7
4.0      5.4            
5.0      5.0            

I've tried AB = merge_ordered(A, B, on='Time', how='outer'), and while it successfully combined the data, it output something akin to:

>>> AB
Time     Voltage     Current
-1.0                 0.5
0.0                  0.6
1.0      5.1            
1.0                  0.3
2.0      5.5            
2.0                  0.4
3.0      5.3            
3.0                  0.7
4.0      5.4            
5.0      5.0            

You'll note that it did not combine rows with shared 'Time' values.

I have also tried merging a la AB = A.merge(B, on='Time', how='outer'), but that outputs something combined, but not sorted, like so:

>>> AB
Time     Voltage     Current
-1.0                 0.5
0.0                  0.6
1.0      5.1            
2.0      5.5            
3.0      5.3         0.7
4.0      5.4            
5.0      5.0            
1.0                  0.3
2.0                  0.4

...it essentially skips some of the data in 'Current' and appends it to the bottom, but it does so inconsistently. And again, it does not merge the rows together.

I have also tried AB = pandas.concat(A, B, axis=1), but the result does not get merged. I simply get, well, the concatenation of the two DataFrames, like so:

>>> AB
Time     Voltage     Time     Current
1.0      5.1         -1.0     0.5
2.0      5.5         0.0      0.6
3.0      5.3         1.0      0.3
4.0      5.4         2.0      0.4
5.0      5.0         3.0      0.7

I've been scouring the documentation and here to try to figure out the exact differences between merge and join, but from what I gather they're pretty similar. Still, I haven't found anything that specifically answers the question of "how to merge rows that share an identical key/index". Can anyone enlighten me on how to do this? I only have a few days-worth of experience with Pandas!

like image 980
J. Day Avatar asked May 31 '17 21:05

J. Day


2 Answers

merge
merge combines on columns. By default it takes all commonly named columns. Otherwise, you can specify which columns to combine on. In this example, I chose, Time.

A.merge(B, 'outer', 'Time')

   Time  Voltage  Current
0   1.0      5.1      0.3
1   2.0      5.5      0.4
2   3.0      5.3      0.7
3   4.0      5.4      NaN
4   5.0      5.0      NaN
5  -1.0      NaN      0.5
6   0.0      NaN      0.6

join
join combines on index values unless you specify the left hand side's column instead. That is why I set the index for the right hand side and Specify a column for the left hand side Time.

A.join(B.set_index('Time'), 'Time', 'outer')

   Time  Voltage  Current
0   1.0      5.1      0.3
1   2.0      5.5      0.4
2   3.0      5.3      0.7
3   4.0      5.4      NaN
4   5.0      5.0      NaN
4  -1.0      NaN      0.5
4   0.0      NaN      0.6    ​

pd.concat
concat combines on index values... so I create a list comprehension in which I iterate over each dataframe I want to combine [A, B]. In the comprehension, each dataframe assumes the name d, hence the for d in [A, B]. axis=1 says to combine them side by side thus using the index as the joining feature.

pd.concat([d.set_index('Time') for d in [A, B]], axis=1).reset_index()

   Time  Voltage  Current
0  -1.0      NaN      0.5
1   0.0      NaN      0.6
2   1.0      5.1      0.3
3   2.0      5.5      0.4
4   3.0      5.3      0.7
5   4.0      5.4      NaN
6   5.0      5.0      NaN

combine_first

A.set_index('Time').combine_first(B.set_index('Time')).reset_index()

   Time  Current  Voltage
0  -1.0      0.5      NaN
1   0.0      0.6      NaN
2   1.0      0.3      5.1
3   2.0      0.4      5.5
4   3.0      0.7      5.3
5   4.0      NaN      5.4
6   5.0      NaN      5.0
like image 106
piRSquared Avatar answered Oct 05 '22 23:10

piRSquared


It should work properly if the Time column is of the same dtype in both DFs:

In [192]: A.merge(B, how='outer').sort_values('Time')
Out[192]:
   Time  Voltage  Current
5  -1.0      NaN      0.5
6   0.0      NaN      0.6
0   1.0      5.1      0.3
1   2.0      5.5      0.4
2   3.0      5.3      0.7
3   4.0      5.4      NaN
4   5.0      5.0      NaN

In [193]: A.dtypes
Out[193]:
Time       float64
Voltage    float64
dtype: object

In [194]: B.dtypes
Out[194]:
Time       float64
Current    float64
dtype: object

Reproducing your problem:

In [198]: A.merge(B.assign(Time=B.Time.astype(str)), how='outer').sort_values('Time')
Out[198]:
   Time  Voltage  Current
5  -1.0      NaN      0.5
6   0.0      NaN      0.6
0   1.0      5.1      NaN
7   1.0      NaN      0.3
1   2.0      5.5      NaN
8   2.0      NaN      0.4
2   3.0      5.3      NaN
9   3.0      NaN      0.7
3   4.0      5.4      NaN
4   5.0      5.0      NaN

In [199]: B.assign(Time=B.Time.astype(str)).dtypes
Out[199]:
Time        object   # <------ NOTE
Current    float64
dtype: object

Visually it's hard to distinguish:

In [200]: B.assign(Time=B.Time.astype(str))
Out[200]:
   Time  Current
0  -1.0      0.5
1   0.0      0.6
2   1.0      0.3
3   2.0      0.4
4   3.0      0.7

In [201]: B
Out[201]:
   Time  Current
0  -1.0      0.5
1   0.0      0.6
2   1.0      0.3
3   2.0      0.4
4   3.0      0.7
like image 27
MaxU - stop WAR against UA Avatar answered Oct 06 '22 01:10

MaxU - stop WAR against UA