I have a dataframe:
A C D
0 one 0.410599 -0.205158
1 one 0.144044 0.313068
2 one 0.333674 -0.742165
3 three 0.761038 -2.552990
4 three 1.494079 2.269755
5 two 1.454274 -0.854096
6 two 0.121675 0.653619
7 two 0.443863 0.864436
Let's assume that A
is the anchor column. I now want to display each group value only once, at the top:
A C D
0 one 0.410599 -0.205158
1 0.144044 0.313068
2 0.333674 -0.742165
3 three 0.761038 -2.552990
4 1.494079 2.269755
5 two 1.454274 -0.854096
6 0.121675 0.653619
7 0.443863 0.864436
This is what I've come up with:
df['A'] = df.groupby('A', as_index=False)['A']\
.apply(lambda x: x.str.replace('.*', '').set_value(0, x.values[0])).values
My strategy was to do a groupby and then set all values to an empty string other than the first. This doesn't seem to work, because I get:
ValueError: Length of values does not match length of index
Which means that the output I get is incorrect. Any ideas/suggestions/improvements welcome.
I should add that I am trying to generalise a solution that can single out values at the top OR bottom OR middle of each group, so I'd give more preference to a solution that helps me do that (to understand, the example above shows how to single out values only at the top of each group, however, I want to generalise a solution that allows me to single them out at the bottom or in the middle).
Your method didn't work because of the index error. When you groupby 'A', the index is represented the same way in the grouped data too. Since set_value(0)
could not find the correct index, it creates a new object with that index. That's the reason why there was a length mismatch.
Fix 1reset_index(drop=True)
df['A'] = df.groupby('A')['A'].apply(lambda x: x.str.replace('.*', '')\
.reset_index(drop=True).set_value(0, x.values[0])).values
df
A C D
0 one 0.410599 -0.205158
1 0.144044 0.313068
2 0.333674 -0.742165
3 three 0.761038 -2.552990
4 1.494079 2.269755
5 two 1.454274 -0.854096
6 0.121675 0.653619
7 0.443863 0.864436
Fix 2set_value
set_value
has a 3rd parameter called takeable
which determines how the index is treated. It is False
by default, but setting it to True
worked for my case.
In addition to Zero's solutions, the solution for isolating values at the centre of their groups is as follows:
df.A = df.groupby('A'['A'].apply(lambda x: x.str.replace('.*', '')\
.set_value(len(x) // 2, x.values[0], True)).values
df
A C D
0 0.410599 -0.205158
1 one 0.144044 0.313068
2 0.333674 -0.742165
3 0.761038 -2.552990
4 three 1.494079 2.269755
5 1.454274 -0.854096
6 two 0.121675 0.653619
7 0.443863 0.864436
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With