Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using groupby with expanding and a custom function

I have a dataframe that consists of truthIds and trackIds:

truthId = ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'C', 'B', 'A', 'A', 'C', 'C']
trackId = [1, 1, 2, 2, 3, 4, 5, 3, 2, 1, 5, 4, 6]
df1 = pd.DataFrame({'truthId': truthId, 'trackId': trackId})
    trackId truthId
0         1       A
1         1       A
2         2       B
3         2       B
4         3       C
5         4       C
6         5       A
7         3       C
8         2       B
9         1       A
10        5       A
11        4       C
12        6       C

I wish to add a column that calculates, for each unique truthId, the length of the set of unique tracksIds that have previously (i.e. from the top of the data to that row) been associated with it:

       truthId  trackId  unique_Ids
0        A        1           1
1        A        1           1
2        B        2           1
3        B        2           1
4        C        3           1
5        C        4           2
6        A        5           2
7        C        3           2
8        B        2           1
9        A        1           2
10       A        5           2
11       C        4           2
12       C        6           3

I am very close to accomplishing this. I can use:

df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})

Which produces the following output:

                trackId
truthId            
A       0       1.0
        1       1.0
        6       2.0
        9       2.0
        10      2.0
B       2       1.0
        3       1.0
        8       1.0
C       4       1.0
        5       2.0
        7       2.0
        11      2.0
        12      3.0

This is consistent with the documentation

However, it throws an error when I attempt to assign this output to a new column:

df['unique_Ids'] = df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})

I have used this workflow before and ideally the new column is put back into the original DateFrame with no issues (i.e. Split-Apply-Combine). How can I get it to work?

like image 307
Tara S Avatar asked Feb 06 '18 18:02

Tara S


1 Answers

You need reset_index

df['Your']=(df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})).reset_index(level=0,drop=True)
df
Out[1162]: 
    trackId truthId  Your
0         1       A   1.0
1         1       A   1.0
2         2       B   1.0
3         2       B   1.0
4         3       C   1.0
5         4       C   2.0
6         5       A   2.0
7         3       C   2.0
8         2       B   1.0
9         1       A   2.0
10        5       A   2.0
11        4       C   2.0
12        6       C   3.0
like image 66
BENY Avatar answered Sep 24 '22 11:09

BENY