Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Concatenate dataframe and keep duplicate indices

I have two dataframes that I would like to concatenate column-wise (axis=1) with an inner join. One of the dataframes has some duplicate indices, but the rows are not duplicates, and I don't want to lose the data from those :

df1 = pd.DataFrame([{'a':1,'b':2},{'a':1,'b':3},{'a':2,'b':4}],
                   columns = ['a','b']).set_index('a')

df2 = pd.DataFrame([{'a':1,'c':5},{'a':2,'c':6}],columns = ['a','c']).set_index('a')

>>> df1
   b
a   
1  2
1  3
2  4
8  9

>>> df2
   c
a   
1  5
2  6

The default concat behavior is to fill missing values with NaNs:

>>> pd.concat([df1,df2])
    b   c
a
1   2 NaN
1   3 NaN
2   4 NaN
1 NaN   5
2 NaN   6

I want to keep the duplicate indices from df1 and fill them with duplicated values from df2, but in pandas 0.13.1 an inner join on the columns produces an error. In more recent versions of pandas concat does what I want:

>>> pd.concat([df1, df2], axis=1, join='inner')
   b  c
a      
1  2  5
1  3  5
2  4  6

What's the best way to achieve the result I want? Is there a groupby solution? Or maybe I shouldn't be using concat at all?

like image 989
andbeonetraveler Avatar asked Jul 10 '14 19:07

andbeonetraveler


People also ask

Does Panda concat remove duplicates?

Output: As shown in the output image, we get the concatenation of dataframes without removing duplicates.

Can Pandas DataFrame have duplicate index?

duplicated() function Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated.

How avoid duplicates in Pandas merge?

Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.

Is PD concat faster than PD append?

In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version.


1 Answers

You can perform a merge and set the params to use the index from the lhs and rhs:

In [4]:    
df1.merge(df2, left_index=True, right_index=True)
Out[4]:
   b  c
a      
1  2  5
1  3  5
2  4  6

[3 rows x 2 columns]

Concat should've worked, it worked for me:

In [5]:

pd.concat([df1,df2], join='inner', axis=1)
Out[5]:
   b  c
a      
1  2  5
1  3  5
2  4  6

[3 rows x 2 columns]
like image 51
EdChum Avatar answered Oct 24 '22 01:10

EdChum