Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas concat appears to ignore indices

Tags:

python

pandas

I'm relatively new to Pandas. I ran into an unexpected issue with pd.concat() I didn't expect.

df1 = pd.DataFrame([], columns=['a', 'b', 'c']).set_index(['b', 'a'])
df2 = pd.DataFrame([[1, 2, 3]], columns=['a', 'b', 'c']).set_index(['a', 'b']) # intentionally reverse
pd.concat([df1, df2])

I would expect the result of the above to be:

     c
a b
1 2  3

but instead it is:

     c
b a <---- note that b=1 and a=2 here
1 2  3

In other words, it appears that pd.concat() is ignoring the index headers when doing the pd.concat(), but then relabeling the headers after the pd.concat() is completed.

On the other hand, pd.concat() works as I would expect with column headers. The result of pd.concat([df1.reset_index(), df2.reset_index()]) is:

     a    b  c
0  1.0  2.0  3

as expected.

Is the behavior that I observed with pd.concat() and indices expected behavior?

I tried Googling around, but I haven't been able to find an example of someone running into an issue similar to this.

Thanks!

like image 637
bacchuswng Avatar asked Dec 07 '25 08:12

bacchuswng


1 Answers

It seems that Pandas during concat:

  • Takes index column names from the first DataFrame only.
  • But for further DataFrames, only the column numbers matter, as long as index columns are matched.

So in case of df1 MultiIndex is composed of column 1 and 0 (numeration starts from 0, but in df2 and df3 - composed of columns 0 and 1, regardless of their names.

To confirm it, try a bit wider example:

df1 = pd.DataFrame([], columns=['a', 'b', 'c']).set_index(['b', 'a'])
df2 = pd.DataFrame([[1, 2, 3]], columns=['aa', 'bb', 'c']).set_index(['aa', 'bb'])
df3 = pd.DataFrame([[10, 20, 30]], columns=['xx', 'yy', 'c']).set_index(['xx', 'yy'])
pd.concat([df1, df2, df3])

The result is:

        c
b  a     
1  2    3
10 20  30

So as you can see, even if source column names (for index columns only) are different, this means nothing. Only their position among columns is important.

But if you change the third column name (of a regular column):

df3 = pd.DataFrame([[10, 20, 30]], columns=['xx', 'yy', 'cc']).set_index(['xx', 'yy'])

(c changed to *cc), the result is different:

         c    cc
b  a            
1  2   3.0   NaN
10 20  NaN  30.0
like image 73
Valdi_Bo Avatar answered Dec 09 '25 22:12

Valdi_Bo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!