I have a table in pandas:
import pandas as pd
df = pd.DataFrame({
'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
'pidx':[10,10,300,10,30,40,20,10,30,45,20],
'pidy':[20,20,400,20,15,20,12,43,54,112,23],
'count':[10,20,30,40,80,10,20,50,30,10,70],
'score':[10,10,10,22,22,3,4,5,9,0,1]
})
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
2 2 30 300 400 10
3 1 40 10 20 22
4 3 80 30 15 22
5 3 10 40 20 3
6 1 20 20 12 4
7 6 50 10 43 5
8 3 30 20 54 9
9 5 10 45 112 0
10 1 70 20 23 1
I want to do a groupby
and then filter the rows where occurrence of pidx
is greater than 2.
That is, filter rows where pidx
is 10 and 20.
I tried using df.groupby('pidx').count()
but it didn't helped me. Also for those rows I have to do 0.4*count+0.6*score.
Desired output is:
LeafID count pidx pidy final_score
1 10 10 20
1 20 10 20
1 40 10 20
6 50 10 43
1 20 20 12
3 30 20 54
1 70 20 23
GROUP BY enables you to use aggregate functions on groups of data returned from a query. FILTER is a modifier used on an aggregate function to limit the values used in an aggregation. All the columns in the select statement that aren't aggregated should be specified in a GROUP BY clause in the query.
If you want to get a single value for each group, use aggregate() (or one of its shortcuts). If you want to get a subset of the original rows, use filter() . And if you want to get a new value for each original row, use transpose() .
Groupby preserves the order of rows within each group.
This is a straightforward application of filter after doing a groupby. In the data you provided, a value of 20 for pidx only occurred twice so it was filtered out.
df.groupby('pidx').filter(lambda x: len(x) > 2)
LeafID count pidx pidy
0 1 10 10 20
1 1 20 10 20
3 1 40 10 20
7 6 50 10 43
You can use value_counts
with boolean indexing
and isin
:
df = pd.DataFrame({
'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
'pidx':[10,10,300,10,30,40,20,10,30,45,20],
'pidy':[20,20,400,20,15,20,12,43,54,112,23],
'count':[10,20,30,40,80,10,20,50,30,10,70],
'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
2 2 30 300 400 10
3 1 40 10 20 22
4 3 80 30 15 22
5 3 10 40 20 3
6 1 20 20 12 4
7 6 50 10 43 5
8 3 30 30 54 9
9 5 10 45 112 0
10 1 70 20 23 1
s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
3 1 40 10 20 22
7 6 50 10 43 5
Timings:
np.random.seed(123)
N = 1000000
L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
'pidx': np.random.randint(10000, size=N),
'pidy': np.random.choice(L2, N),
'count':np.random.randint(1000, size=N)})
print (df)
print (df.groupby('pidx').filter(lambda x: len(x) > 120))
def jez(df):
s = df.pidx.value_counts()
return df[df.pidx.isin(s[s>120].index)]
print (jez(df))
In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop
In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop
In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop
In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop
In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop
For final_score
you can use:
df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))
pandas
df[df.groupby('pidx').pidx.transform('count') > 2]
LeafID count pidx pidy score
0 1 10 10 20 10
1 1 20 10 20 10
3 1 40 10 20 22
7 6 50 10 43 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With