Say I have a dataframe my_df with column duplicates, e..g
foo bar foo hello
0   1   1   5
1   1   2   5
2   1   3   5
I would like to create another dataframe that averages the duplicates:
foo bar hello
0.5   1   5
1.5   1   5
2.5   1   5
How can I do this in Pandas?
So far I have managed to identify duplicates:
my_columns = my_df.columns
my_duplicates = print [x for x, y in collections.Counter(my_columns).items() if y > 1]
By I don't know how to ask Pandas to average them.
You can groupby the column index and take the mean:
In [11]: df.groupby(level=0, axis=1).mean()
Out[11]:
   bar  foo  hello
0    1  0.5      5
1    1  1.5      5
2    1  2.5      5
A somewhat trickier example is if there is a non numeric column:
In [21]: df
Out[21]:
   foo  bar  foo hello
0    0    1    1     a
1    1    1    2     a
2    2    1    3     a
The above will raise: DataError: No numeric types to aggregate. Definitely not going to win any prizes for efficiency, but here's generic method to do in this case: 
In [22]: dupes = df.columns.get_duplicates()
In [23]: dupes
Out[23]: ['foo']
In [24]: pd.DataFrame({d: df[d] for d in df.columns if d not in dupes})
Out[24]:
   bar hello
0    1     a
1    1     a
2    1     a
In [25]: pd.concat(df.xs(d, axis=1) for d in dupes).groupby(level=0, axis=1).mean()
Out[25]:
   foo
0  0.5
1  1.5
2  2.5
In [26]: pd.concat([Out[24], Out[25]], axis=1)
Out[26]:
   foo  bar hello
0  0.5    1     a
1  1.5    1     a
2  2.5    1     a
I think the thing to take away is avoid column duplicates... or perhaps that I don't know what I'm doing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With