Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

groupby and return all rows of first n groups

I have a pandas dataframe like below

>>> df.head()
       0      1       2          3   4           5                      6
0  35000  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000CE
1  35001  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000PE
2  35002  26000  OPTIDX      NIFTY  XX  1609425000      NIFTY20DEC10400CE
3  35003  26000  OPTIDX      NIFTY  XX  1609425000      NIFTY20DEC10400PE
4  35004  26009  OPTIDX  BANKNIFTY  XX  1499956200  BANKNIFTY1771321100CE

I want to group these in by column 5 in sorted order and return first n groups where n can be given as variable.

I did df.sort_values(5).groupby([5]) i get <pandas.core.groupby.DataFrameGroupBy object at 0x2afc8d0>

How do I get all rows in first 2 groups. In the sample df above group 1 would be 1499351400, group 2 would be 1499351400 group 3 would be 1609425000

Expected output: when groups required =2

       0      1       2          3   4           5                      6
0  35000  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000CE
1  35001  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000PE
4  35004  26009  OPTIDX  BANKNIFTY  XX  1499956200  BANKNIFTY1771321100CE

Update1:After trying @jezrael's

>>> k2=k1[k1.groupby(5).ngroup() < 2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/python/2.7/lib/python2.7/site-packages/pandas/core/groupby.py", line 529, in __getattr__
    (type(self).__name__, attr))
AttributeError: 'DataFrameGroupBy' object has no attribute 'ngroup'

Additional: Is it possible to do it without pandas (only python), I may not always find machines with pandas on them. Thanks

like image 297
pythonRcpp Avatar asked Oct 29 '22 05:10

pythonRcpp


2 Answers

If you can't use ngroup, just rank the elements with 'dense' and use that to index into df:

In [24]: df.loc[df[5].rank(method='dense') <= 2]
Out[24]: 
       0      1       2          3   4           5                      6
0  35000  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000CE
1  35001  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000PE
4  35004  26009  OPTIDX  BANKNIFTY  XX  1499956200  BANKNIFTY1771321100CE

This works because rank(method='dense') gives us the sorted rank for each number seen:

In [25]: df[5].rank(method='dense')
Out[25]: 
0    1.0
1    1.0
2    3.0
3    3.0
4    2.0
Name: 5, dtype: float64

(P.S. In an odd coincidence, I added both ngroup and method='dense', so this question makes me very happy. :-)

like image 114
DSM Avatar answered Nov 13 '22 23:11

DSM


Use ngroup (working in 0.20.2) with boolean indexing:

df = df.sort_values(5)

print (df.groupby(5).ngroup())
0    0
1    0
4    1
2    2
3    2
dtype: int64

df = df[df.groupby(5).ngroup() < 2]
print (df)
       0      1       2          3   4           5                      6
0  35000  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000CE
1  35001  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000PE
4  35004  26009  OPTIDX  BANKNIFTY  XX  1499956200  BANKNIFTY1771321100CE

For oldier version of pandas use a little hack - information are hidden in object grouper.group_info, so select first array by [0]:

df = df.sort_values(5)

print (df.groupby([5]).grouper.group_info)
(array([0, 0, 2, 2, 1], dtype=int64), array([0, 1, 2]), 3)

print (df.groupby([5]).grouper.group_info[0])
[0 0 2 2 1]

df = df[df.groupby([5]).grouper.group_info[0] < 2]
print (df)
       0      1       2          3   4           5                      6
0  35000  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000CE
1  35001  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000PE
4  35004  26009  OPTIDX  BANKNIFTY  XX  1499956200  BANKNIFTY1771321100CE

Alternative solution with factorize:

df = df.sort_values(5)
df = df[pd.factorize(df[5])[0] < 2]
print (df)
       0      1       2          3   4           5                      6
0  35000  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000CE
1  35001  26009  OPTIDX  BANKNIFTY  XX  1499351400  BANKNIFTY1770621000PE
4  35004  26009  OPTIDX  BANKNIFTY  XX  1499956200  BANKNIFTY1771321100CE
like image 35
jezrael Avatar answered Nov 14 '22 01:11

jezrael