I have a table in pandas: <pre class="prettyprint"><code>import pandas as pd df = pd.DataFrame({ 'LeafID':[1,1,2,1,3,3,1,6,3,5,1], 'pidx':[10,10,300,10,30,40,20,10,30,45,20], 'pidy':[20,20,400,20,15,20,12,43,54,112,23], 'count':[10,20,30,40,80,10,20,50,30,10,70], 'score':[10,10,10,22,22,3,4,5,9,0,1] }) LeafID count pidx pidy score 0 1 10 10 20 10 1 1 20 10 20 10 2 2 30 300 400 10 3 1 40 10 20 22 4 3 80 30 15 22 5 3 10 40 20 3 6 1 20 20 12 4 7 6 50 10 43 5 8 3 30 20 54 9 9 5 10 45 112 0 10 1 70 20 23 1 </code></pre> I want to do a <code>groupby</code> and then filter the rows where occurrence of <code>pidx</code> is greater than 2. That is, filter rows where <code>pidx</code> is 10 and 20. I tried using <code>df.groupby('pidx').count()</code> but it didn't helped me. Also for those rows I have to do 0.4*count+0.6*score. Desired output is: <pre class="prettyprint"><code>LeafID count pidx pidy final_score 1 10 10 20 1 20 10 20 1 40 10 20 6 50 10 43 1 20 20 12 3 30 20 54 1 70 20 23 </code></pre>

This is a straightforward application of filter after doing a groupby. In the data you provided, a value of 20 for pidx only occurred twice so it was filtered out. <pre class="prettyprint"><code>df.groupby('pidx').filter(lambda x: len(x) > 2) LeafID count pidx pidy 0 1 10 10 20 1 1 20 10 20 3 1 40 10 20 7 6 50 10 43 </code></pre>

<code>pandas</code> <pre class="prettyprint"><code>df[df.groupby('pidx').pidx.transform('count') > 2] LeafID count pidx pidy score 0 1 10 10 20 10 1 1 20 10 20 10 3 1 40 10 20 22 7 6 50 10 43 5 </code></pre>

Filter rows after groupby pandas

Tags:

python

pandas

I have a table in pandas:

import pandas as pd

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})

LeafID  count       pidx     pidy   score
0   1       10           10        20     10
1   1       20           10        20     10
2   2       30          300       400     10
3   1       40           10        20     22
4   3       80           30        15     22
5   3       10           40        20      3
6   1       20           20        12      4
7   6       50           10        43      5
8   3       30           20        54      9
9   5       10           45       112      0
10  1       70           20        23      1

I want to do a groupby and then filter the rows where occurrence of pidx is greater than 2.

That is, filter rows where pidx is 10 and 20.

I tried using df.groupby('pidx').count() but it didn't helped me. Also for those rows I have to do 0.4*count+0.6*score.

Desired output is:

LeafID    count       pidx     pidy    final_score
   1       10           10        20
   1       20           10        20
   1       40           10        20
   6       50           10        43
   1       20           20        12
   3       30           20        54
   1       70           20        23

388

asked Jan 24 '17 06:01

Shubham R

3 Answers

This is a straightforward application of filter after doing a groupby. In the data you provided, a value of 20 for pidx only occurred twice so it was filtered out.

df.groupby('pidx').filter(lambda x: len(x) > 2)

   LeafID  count  pidx  pidy
0       1     10    10    20
1       1     20    10    20
3       1     40    10    20
7       6     50    10    43

157

answered Oct 21 '22 10:10

Ted Petrou

You can use value_counts with boolean indexing and isin:

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
    LeafID  count  pidx  pidy  score
0        1     10    10    20     10
1        1     20    10    20     10
2        2     30   300   400     10
3        1     40    10    20     22
4        3     80    30    15     22
5        3     10    40    20      3
6        1     20    20    12      4
7        6     50    10    43      5
8        3     30    30    54      9
9        5     10    45   112      0
10       1     70    20    23      1

s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

Timings:

np.random.seed(123)
N = 1000000


L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'count':np.random.randint(1000, size=N)})
print (df)


print (df.groupby('pidx').filter(lambda x: len(x) > 120))

def jez(df):
    s = df.pidx.value_counts()
    return df[df.pidx.isin(s[s>120].index)]

print (jez(df))

In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop

In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop

In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop

In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop

In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop

For final_score you can use:

df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))

answered Oct 21 '22 11:10

jezrael

pandas

df[df.groupby('pidx').pidx.transform('count') > 2]


   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

answered Oct 21 '22 09:10

piRSquared

Related questions
                            
                                Python error message "Incompatible library version" libxml and etree.so
                            
                                how to use python "get()" for keys deeper than first level of dictionary keys?
                            
                                Issue with UTF-/ encoding on csv file for excel
                            
                                How to accumulate an array by index in numpy? [duplicate]
                            
                                Use of 'random_state' parameter in sklearn.utils.shuffle?
                            
                                Reading a github file using python returns HTML tags
                            
                                Unable to install Statsmodels...python
                            
                                How to ignore NULL byte when reading a csv file
                            
                                How do I apply both bold and italics in python-docx?
                            
                                Python: ImportError: No module named 'tutorial.quickstart'
                            
                                How to rename (exposed in API) filter field name using django-filters?
                            
                                How to map a column with dask
                            
                                pyexcel export error "No content, file name. Nothing is given"
                            
                                struct.error: unpack requires a string argument of length 16
                            
                                How to edit the label font sizes on building a treemap with squarify in Python?
                            
                                Python Convert String to Byte
                            
                                How to generate reports in Behave-Python?
                            
                                Jenkins not printing output of python script in console
                            
                                Extract image position from .docx file using python-docx
                            
                                Remove NaN values from dataframe without fillna or Interpolate

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With