Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using dplyr with filter, group_by & tail?

Tags:

r

dplyr

Here's an example df:

df <- structure(list(x = 1:30, y = 101:130, g = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("x", "y", "g"), row.names = c(NA, -30L), class = "data.frame")

I would like to get the 10 lowest values of y for each group within the filtered data.

But

df2 <- df %>% filter(x>3) %>% group_by(g) %>%  tail(y, n=10)

only returns the rows for the last group (C in this case):

Source: local data frame [10 x 3]
Groups: g

    x   y g
18 21 121 C
19 22 122 C
20 23 123 C
21 24 124 C
22 25 125 C
23 26 126 C
24 27 127 C
25 28 128 C
26 29 129 C
27 30 130 C

What am I doing wrong?

like image 979
erc Avatar asked Jul 01 '14 14:07

erc


People also ask

How does dplyr Group_by work?

Groupby Function in R – group_by is used to group the dataframe in R. Dplyr package in R is provided with group_by() function which groups the dataframe by multiple columns with mean, sum and other functions like count, maximum and minimum.

What does %>% do in dplyr?

%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).

What is the difference between the Group_by and filter function?

group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". ungroup() removes grouping. The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions.

Is dplyr filter faster than base R?

In conclusion, dplyr is pretty fast (way faster than base R or plyr) but data. table is somewhat faster especially for very large datasets and a large number of groups. For datasets under a million rows operations on dplyr (or data.


2 Answers

You can use tail inside do.

df2 <- df %>% filter(x>3) %>% group_by(g) %>%  do(tail(., n=10))

The use of . is key for this to work. From the do help page: "You can use . to refer to the current group."

Edit:

As @beginneR pointed out, I was focusing on how to use tail in groups with dplyr and missed the part of the question where the OP asked for the 10 lowest values of y. To do this correctly would take the addition of arrange. With tail, this would mean arranging by descending order of y.

df2 <- df %>% filter(x>3) %>% group_by(g) %>%  arrange(desc(y)) %>% do(tail(., n=10))
like image 180
aosmith Avatar answered Oct 22 '22 18:10

aosmith


Here are two other options:

df %>% filter(x>3) %>% group_by(g) %>% top_n(3, desc(y))

Here we make use of top_n but use desc(y) since we want the lowest y values instead of the largest ("top") y values.

df %>% filter(x>3) %>% group_by(g) %>% arrange(y) %>% filter(1:n() <= 10)

which is equal to

df %>% filter(x>3) %>% group_by(g) %>% arrange(y) %>% slice(1:10)

After the grouping, we sort each group by increasing y and then select the first 10 rows per group (or less if there are not 10 rows in a group).

Since there was some confusion about lowest and last values to be selected: this answer selects the lowest values, not the last entries.

like image 44
talat Avatar answered Oct 22 '22 19:10

talat