What is the proper way to query top N rows by group in python datatable?
For example to get top 2 rows having largest v3
value by id2, id4
group I would do pandas expression in the following way:
df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2)
in R using data.table
:
DT[order(-v3), head(v3, 2L), by=.(id2, id4)]
or in R using dplyr
:
DF %>% arrange(desc(v3)) %>% group_by(id2, id4) %>% filter(row_number() <= 2L)
Example data and expected output using pandas:
import datatable as dt
dt = dt.Frame(id2=[1, 2, 1, 2, 1, 2], id4=[1, 1, 1, 1, 1, 1], v3=[1, 3, 2, 3, 3, 3])
df = dt.to_pandas()
df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2)
# id2 id4 v3
#1 2 1 3
#3 2 1 3
#4 1 1 3
#2 1 1 2
Use pandas. DataFrame. head(n) to get the first n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the start).
Pandas nlargest function can take more than one variable to order the top rows. We can give a list of variables as input to nlargest and get first n rows ordered by the list of columns in descending order. Here we get top 3 rows with largest values in column “lifeExp” and then “gdpPercap”.
You can use df. head() to get the first N rows in Pandas DataFrame. Alternatively, you can specify a negative number within the brackets to get all the rows, excluding the last N rows.
Starting from datatable
version 0.8.0, this can be achieved by combining grouping, sorting and filtering:
from datatable import *
DT = Frame(id2=[1, 2, 1, 2, 1, 2],
id4=[1, 1, 1, 1, 1, 1],
v3=[1, 3, 2, 3, 3, 3])
DT[:2, :, by(f.id2, f.id4), sort(-f.v3)]
which produces
id2 id4 v3
--- --- --- --
0 1 1 3
1 1 1 2
2 2 1 3
3 2 1 3
[4 rows x 3 columns]
Explanation:
by(f.id2, f.id4)
groups the data by columns "id2" and "id4";sort(-f.v3)
command tells datatable
to sort the records by column "v3" in descending order. In the presence of by()
this operator will be applied within each group;:2
selects the top 2 rows, again within each group;:
selects all columns. If needed, this could have been a list of columns or expressions, allowing you to perform some operation(s) on the first 2 rows of each group.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With