Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter rows after groupby pandas

Tags:

python

pandas

I have a table in pandas:

import pandas as pd

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})

LeafID  count       pidx     pidy   score
0   1       10           10        20     10
1   1       20           10        20     10
2   2       30          300       400     10
3   1       40           10        20     22
4   3       80           30        15     22
5   3       10           40        20      3
6   1       20           20        12      4
7   6       50           10        43      5
8   3       30           20        54      9
9   5       10           45       112      0
10  1       70           20        23      1

I want to do a groupby and then filter the rows where occurrence of pidx is greater than 2.

That is, filter rows where pidx is 10 and 20.

I tried using df.groupby('pidx').count() but it didn't helped me. Also for those rows I have to do 0.4*count+0.6*score.

Desired output is:

LeafID    count       pidx     pidy    final_score
   1       10           10        20
   1       20           10        20
   1       40           10        20
   6       50           10        43
   1       20           20        12
   3       30           20        54
   1       70           20        23
like image 388
Shubham R Avatar asked Jan 24 '17 06:01

Shubham R


People also ask

How do you filter in Groupby?

GROUP BY enables you to use aggregate functions on groups of data returned from a query. FILTER is a modifier used on an aggregate function to limit the values used in an aggregation. All the columns in the select statement that aren't aggregated should be specified in a GROUP BY clause in the query.

What is the difference between aggregating transforming and filtering data?

If you want to get a single value for each group, use aggregate() (or one of its shortcuts). If you want to get a subset of the original rows, use filter() . And if you want to get a new value for each original row, use transpose() .

Does pandas Groupby keep order?

Groupby preserves the order of rows within each group.


3 Answers

This is a straightforward application of filter after doing a groupby. In the data you provided, a value of 20 for pidx only occurred twice so it was filtered out.

df.groupby('pidx').filter(lambda x: len(x) > 2)

   LeafID  count  pidx  pidy
0       1     10    10    20
1       1     20    10    20
3       1     40    10    20
7       6     50    10    43
like image 157
Ted Petrou Avatar answered Oct 21 '22 10:10

Ted Petrou


You can use value_counts with boolean indexing and isin:

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
    LeafID  count  pidx  pidy  score
0        1     10    10    20     10
1        1     20    10    20     10
2        2     30   300   400     10
3        1     40    10    20     22
4        3     80    30    15     22
5        3     10    40    20      3
6        1     20    20    12      4
7        6     50    10    43      5
8        3     30    30    54      9
9        5     10    45   112      0
10       1     70    20    23      1

s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

Timings:

np.random.seed(123)
N = 1000000


L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'count':np.random.randint(1000, size=N)})
print (df)


print (df.groupby('pidx').filter(lambda x: len(x) > 120))

def jez(df):
    s = df.pidx.value_counts()
    return df[df.pidx.isin(s[s>120].index)]

print (jez(df))

In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop

In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop

In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop

In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop

In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop

For final_score you can use:

df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))
like image 28
jezrael Avatar answered Oct 21 '22 11:10

jezrael


pandas

df[df.groupby('pidx').pidx.transform('count') > 2]


   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5
like image 43
piRSquared Avatar answered Oct 21 '22 09:10

piRSquared