I have a pandas dataframe like below
>>> df.head()
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
2 35002 26000 OPTIDX NIFTY XX 1609425000 NIFTY20DEC10400CE
3 35003 26000 OPTIDX NIFTY XX 1609425000 NIFTY20DEC10400PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
I want to group these in by column 5 in sorted order and return first n groups where n can be given as variable.
I did df.sort_values(5).groupby([5])
i get <pandas.core.groupby.DataFrameGroupBy object at 0x2afc8d0>
How do I get all rows in first 2 groups. In the sample df above group 1 would be 1499351400, group 2 would be 1499351400 group 3 would be 1609425000
Expected output: when groups required =2
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
Update1:After trying @jezrael's
>>> k2=k1[k1.groupby(5).ngroup() < 2]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/python/2.7/lib/python2.7/site-packages/pandas/core/groupby.py", line 529, in __getattr__
(type(self).__name__, attr))
AttributeError: 'DataFrameGroupBy' object has no attribute 'ngroup'
Additional: Is it possible to do it without pandas (only python), I may not always find machines with pandas on them. Thanks
If you can't use ngroup
, just rank the elements with 'dense'
and use that to index into df:
In [24]: df.loc[df[5].rank(method='dense') <= 2]
Out[24]:
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
This works because rank(method='dense')
gives us the sorted rank for each number seen:
In [25]: df[5].rank(method='dense')
Out[25]:
0 1.0
1 1.0
2 3.0
3 3.0
4 2.0
Name: 5, dtype: float64
(P.S. In an odd coincidence, I added both ngroup
and method='dense'
, so this question makes me very happy. :-)
Use ngroup
(working in 0.20.2
) with boolean indexing
:
df = df.sort_values(5)
print (df.groupby(5).ngroup())
0 0
1 0
4 1
2 2
3 2
dtype: int64
df = df[df.groupby(5).ngroup() < 2]
print (df)
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
For oldier version of pandas use a little hack - information are hidden in object grouper.group_info
, so select first array by [0]
:
df = df.sort_values(5)
print (df.groupby([5]).grouper.group_info)
(array([0, 0, 2, 2, 1], dtype=int64), array([0, 1, 2]), 3)
print (df.groupby([5]).grouper.group_info[0])
[0 0 2 2 1]
df = df[df.groupby([5]).grouper.group_info[0] < 2]
print (df)
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
Alternative solution with factorize
:
df = df.sort_values(5)
df = df[pd.factorize(df[5])[0] < 2]
print (df)
0 1 2 3 4 5 6
0 35000 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000CE
1 35001 26009 OPTIDX BANKNIFTY XX 1499351400 BANKNIFTY1770621000PE
4 35004 26009 OPTIDX BANKNIFTY XX 1499956200 BANKNIFTY1771321100CE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With