I have two dataframes that I would like to concatenate column-wise (axis=1) with an inner join. One of the dataframes has some duplicate indices, but the rows are not duplicates, and I don't want to lose the data from those : <pre class="prettyprint"><code>df1 = pd.DataFrame([{'a':1,'b':2},{'a':1,'b':3},{'a':2,'b':4}], columns = ['a','b']).set_index('a') df2 = pd.DataFrame([{'a':1,'c':5},{'a':2,'c':6}],columns = ['a','c']).set_index('a') >>> df1 b a 1 2 1 3 2 4 8 9 >>> df2 c a 1 5 2 6 </code></pre> The default <code>concat</code> behavior is to fill missing values with NaNs: <pre class="prettyprint"><code>>>> pd.concat([df1,df2]) b c a 1 2 NaN 1 3 NaN 2 4 NaN 1 NaN 5 2 NaN 6 </code></pre> I want to keep the duplicate indices from df1 and fill them with duplicated values from df2, but in pandas 0.13.1 an inner join on the columns produces an error. In more recent versions of pandas concat does what I want: <pre class="prettyprint"><code>>>> pd.concat([df1, df2], axis=1, join='inner') b c a 1 2 5 1 3 5 2 4 6 </code></pre> What's the best way to achieve the result I want? Is there a groupby solution? Or maybe I shouldn't be using <code>concat</code> at all?

You can perform a merge and set the params to use the index from the lhs and rhs: <pre class="prettyprint"><code>In [4]: df1.merge(df2, left_index=True, right_index=True) Out[4]: b c a 1 2 5 1 3 5 2 4 6 [3 rows x 2 columns] </code></pre> Concat should've worked, it worked for me: <pre class="prettyprint"><code>In [5]: pd.concat([df1,df2], join='inner', axis=1) Out[5]: b c a 1 2 5 1 3 5 2 4 6 [3 rows x 2 columns] </code></pre>

Pandas: Concatenate dataframe and keep duplicate indices

Tags:

python

concat

pandas

I have two dataframes that I would like to concatenate column-wise (axis=1) with an inner join. One of the dataframes has some duplicate indices, but the rows are not duplicates, and I don't want to lose the data from those :

df1 = pd.DataFrame([{'a':1,'b':2},{'a':1,'b':3},{'a':2,'b':4}],
                   columns = ['a','b']).set_index('a')

df2 = pd.DataFrame([{'a':1,'c':5},{'a':2,'c':6}],columns = ['a','c']).set_index('a')

>>> df1
   b
a   
1  2
1  3
2  4
8  9

>>> df2
   c
a   
1  5
2  6

The default concat behavior is to fill missing values with NaNs:

>>> pd.concat([df1,df2])
    b   c
a
1   2 NaN
1   3 NaN
2   4 NaN
1 NaN   5
2 NaN   6

I want to keep the duplicate indices from df1 and fill them with duplicated values from df2, but in pandas 0.13.1 an inner join on the columns produces an error. In more recent versions of pandas concat does what I want:

>>> pd.concat([df1, df2], axis=1, join='inner')
   b  c
a      
1  2  5
1  3  5
2  4  6

What's the best way to achieve the result I want? Is there a groupby solution? Or maybe I shouldn't be using concat at all?

989

asked Jul 10 '14 19:07

andbeonetraveler

1 Answers

You can perform a merge and set the params to use the index from the lhs and rhs:

In [4]:    
df1.merge(df2, left_index=True, right_index=True)
Out[4]:
   b  c
a      
1  2  5
1  3  5
2  4  6

[3 rows x 2 columns]

Concat should've worked, it worked for me:

In [5]:

pd.concat([df1,df2], join='inner', axis=1)
Out[5]:
   b  c
a      
1  2  5
1  3  5
2  4  6

[3 rows x 2 columns]

answered Oct 24 '22 01:10

EdChum

Related questions
                            
                                python: simple example for a python egg with a one-file source file?
                            
                                Good real-world uses of metaclasses (e.g. in Python)
                            
                                Python, rasing an exception without arguments
                            
                                Is there any reason *not* to cache an object's hash?
                            
                                numpy: inverting an upper triangular matrix
                            
                                Can I force a numpy ndarray to take ownership of its memory?
                            
                                PyQt: Trying to understand graphics scene/view
                            
                                Python: ImportError: No module named _md5
                            
                                Parse only one level of json
                            
                                Python losing control of subprocess?
                            
                                Specifying anchor names in reST
                            
                                How would I run a script file as part of the python setup.py install?
                            
                                AttributeError: 'NoneType' object has no attribute 'endswith'
                            
                                subprocess's Popen closes stdout/stderr filedescriptors used in another thread when Popen errors
                            
                                Decrypt using an RSA public key with PyCrypto
                            
                                Creating a multithreaded server using SocketServer framework in python
                            
                                How do I launch a file in its default program, and then close it when the script finishes?
                            
                                Filling holes in objects that touch the border of an image
                            
                                How to see exceptions in a Flask + gunicorn app?
                            
                                Creating a large dictionary in pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With