I have a pandas data frame looks like:
df = pd.DataFrame(data = {
'v1': ['a', 'a', 'c', 'b', 'd', 'c', 'd', 'c', 'f', 'e'],
'v2': ['b', 'b', 'd', 'a', 'c', 'e', 'c', 'd', 'g', 'c'],
'v3': range(0,10)})
v1 v2 v3
0 a b 0
1 a b 1
2 c d 2
3 a b 3
4 c d 4
5 c e 5
6 c d 6
7 c d 7
8 f g 8
9 c e 9
I want to get result looks like:
v1 v2 count
0 a b 2
1 c d 3
2 c e 2
3 f g 1
The first row "a b 2", comes from "a b", "a b", "b a". as "a b" appears continuously twice, count only once.
The second row "c d 3" comes from "c d", "d c", "d c", "c d", as "d c" appears continuously twice
The third row "c e 2" comes from "c e", "e c"
The last row "f g 1" comes from "f g"
My initial thought is to use groupby, but I don't know how to skip the continuous appearance.
Sort the first two columns, drop consecutive duplicates, and then count them:
df.iloc[:, :2] = np.sort(df.iloc[:, :2], axis=1)
m = ~df.iloc[:, :2].ne(df.iloc[:, :2].shift()).cumsum().duplicated()
df[m].groupby(['v1', 'v2'], as_index=False).count()
v1 v2 v3
0 a b 2
1 c d 3
2 c e 2
3 f g 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With