What is the proper way to query top N rows by group in python datatable? For example to get top 2 rows having largest <code>v3</code> value by <code>id2, id4</code> group I would do pandas expression in the following way: <pre class="prettyprint"><code>df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2) </code></pre> in R using <code>data.table</code>: <pre class="prettyprint"><code>DT[order(-v3), head(v3, 2L), by=.(id2, id4)] </code></pre> or in R using <code>dplyr</code>: <pre class="prettyprint"><code>DF %>% arrange(desc(v3)) %>% group_by(id2, id4) %>% filter(row_number() <= 2L) </code></pre> Example data and expected output using pandas: <pre class="prettyprint"><code>import datatable as dt dt = dt.Frame(id2=[1, 2, 1, 2, 1, 2], id4=[1, 1, 1, 1, 1, 1], v3=[1, 3, 2, 3, 3, 3]) df = dt.to_pandas() df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2) # id2 id4 v3 #1 2 1 3 #3 2 1 3 #4 1 1 3 #2 1 1 2 </code></pre>

Starting from <code>datatable</code> version 0.8.0, this can be achieved by combining grouping, sorting and filtering: <pre class="prettyprint"><code>from datatable import * DT = Frame(id2=[1, 2, 1, 2, 1, 2], id4=[1, 1, 1, 1, 1, 1], v3=[1, 3, 2, 3, 3, 3]) DT[:2, :, by(f.id2, f.id4), sort(-f.v3)] </code></pre> which produces <pre class="prettyprint"><code> id2 id4 v3 --- --- --- -- 0 1 1 3 1 1 1 2 2 2 1 3 3 2 1 3 [4 rows x 3 columns] </code></pre> Explanation: <ul> <li> <code>by(f.id2, f.id4)</code> groups the data by columns "id2" and "id4";</li> <li>the <code>sort(-f.v3)</code> command tells <code>datatable</code> to sort the records by column "v3" in descending order. In the presence of <code>by()</code> this operator will be applied within each group;</li> <li>the first <code>:2</code> selects the top 2 rows, again within each group;</li> <li>the second <code>:</code> selects all columns. If needed, this could have been a list of columns or expressions, allowing you to perform some operation(s) on the first 2 rows of each group.</li> </ul>

Top N rows by group using python datatable

Tags:

python

r

py-datatable

What is the proper way to query top N rows by group in python datatable?
For example to get top 2 rows having largest v3 value by id2, id4 group I would do pandas expression in the following way:

df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2)

in R using data.table:

DT[order(-v3), head(v3, 2L), by=.(id2, id4)]

or in R using dplyr:

DF %>% arrange(desc(v3)) %>% group_by(id2, id4) %>% filter(row_number() <= 2L)

Example data and expected output using pandas:

import datatable as dt
dt = dt.Frame(id2=[1, 2, 1, 2, 1, 2], id4=[1, 1, 1, 1, 1, 1], v3=[1, 3, 2, 3, 3, 3])
df = dt.to_pandas()
df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2)
#   id2  id4  v3
#1    2    1   3
#3    2    1   3
#4    1    1   3
#2    1    1   2

679

asked Jan 10 '19 12:01

jangorecki

1 Answers

Starting from datatable version 0.8.0, this can be achieved by combining grouping, sorting and filtering:

from datatable import *
DT = Frame(id2=[1, 2, 1, 2, 1, 2], 
           id4=[1, 1, 1, 1, 1, 1], 
           v3=[1, 3, 2, 3, 3, 3])

DT[:2, :, by(f.id2, f.id4), sort(-f.v3)]

which produces

     id2  id4  v3
---  ---  ---  --
 0     1    1   3
 1     1    1   2
 2     2    1   3
 3     2    1   3

[4 rows x 3 columns]

Explanation:

by(f.id2, f.id4) groups the data by columns "id2" and "id4";
the sort(-f.v3) command tells datatable to sort the records by column "v3" in descending order. In the presence of by() this operator will be applied within each group;
the first :2 selects the top 2 rows, again within each group;
the second : selects all columns. If needed, this could have been a list of columns or expressions, allowing you to perform some operation(s) on the first 2 rows of each group.

195

answered Sep 30 '22 04:09

Pasha

Related questions
                            
                                How do I accumulate a sequence of digits in a string and convert them to one number?
                            
                                Setting tick colors of matplotlib 3D plot
                            
                                Putting a python script into a docker container
                            
                                AWS lambda CLI 'update-function-code' does not update lambda_handler file
                            
                                How can I launch pyqt GUI multiple times consequtively in a process?
                            
                                How to normalize a non-normal distribution?
                            
                                DNNClassifier: 'DataFrame' object has no attribute 'dtype'
                            
                                Mark every Nth row per group using pandas
                            
                                Python generate a mask for the lower triangle of a matrix
                            
                                Pandas .str.replace and case insensitivity
                            
                                Generate 'K' Nearest Neighbours to a datapoint
                            
                                Create a tree from a given dictionary
                            
                                tensorflow sparse categorical cross entropy with logits
                            
                                What exactly defines a function in Python
                            
                                how to insert a element at specific index in python list
                            
                                ValueError: You are trying to load a weight file containing 6 layers into a model with 0
                            
                                how to return the order index of each element of a list? [duplicate]
                            
                                React Tutorial history map (step, move)
                            
                                pythonic style for functional programming
                            
                                Tensorflow: Different results with the same random seed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With