I'm relatively new to Pandas. I ran into an unexpected issue with pd.concat() I didn't expect.
df1 = pd.DataFrame([], columns=['a', 'b', 'c']).set_index(['b', 'a'])
df2 = pd.DataFrame([[1, 2, 3]], columns=['a', 'b', 'c']).set_index(['a', 'b']) # intentionally reverse
pd.concat([df1, df2])
I would expect the result of the above to be:
c
a b
1 2 3
but instead it is:
c
b a <---- note that b=1 and a=2 here
1 2 3
In other words, it appears that pd.concat() is ignoring the index headers when doing the pd.concat(), but then relabeling the headers after the pd.concat() is completed.
On the other hand, pd.concat() works as I would expect with column headers. The result of pd.concat([df1.reset_index(), df2.reset_index()]) is:
a b c
0 1.0 2.0 3
as expected.
Is the behavior that I observed with pd.concat() and indices expected behavior?
I tried Googling around, but I haven't been able to find an example of someone running into an issue similar to this.
Thanks!
It seems that Pandas during concat:
So in case of df1 MultiIndex is composed of column 1 and 0 (numeration starts from 0, but in df2 and df3 - composed of columns 0 and 1, regardless of their names.
To confirm it, try a bit wider example:
df1 = pd.DataFrame([], columns=['a', 'b', 'c']).set_index(['b', 'a'])
df2 = pd.DataFrame([[1, 2, 3]], columns=['aa', 'bb', 'c']).set_index(['aa', 'bb'])
df3 = pd.DataFrame([[10, 20, 30]], columns=['xx', 'yy', 'c']).set_index(['xx', 'yy'])
pd.concat([df1, df2, df3])
The result is:
c
b a
1 2 3
10 20 30
So as you can see, even if source column names (for index columns only) are different, this means nothing. Only their position among columns is important.
But if you change the third column name (of a regular column):
df3 = pd.DataFrame([[10, 20, 30]], columns=['xx', 'yy', 'cc']).set_index(['xx', 'yy'])
(c changed to *cc), the result is different:
c cc
b a
1 2 3.0 NaN
10 20 NaN 30.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With