Here's an example df:
df <- structure(list(x = 1:30, y = 101:130, g = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("x", "y", "g"), row.names = c(NA, -30L), class = "data.frame")
I would like to get the 10 lowest values of y for each group within the filtered data.
But
df2 <- df %>% filter(x>3) %>% group_by(g) %>% tail(y, n=10)
only returns the rows for the last group (C in this case):
Source: local data frame [10 x 3]
Groups: g
x y g
18 21 121 C
19 22 122 C
20 23 123 C
21 24 124 C
22 25 125 C
23 26 126 C
24 27 127 C
25 28 128 C
26 29 129 C
27 30 130 C
What am I doing wrong?
Groupby Function in R – group_by is used to group the dataframe in R. Dplyr package in R is provided with group_by() function which groups the dataframe by multiple columns with mean, sum and other functions like count, maximum and minimum.
%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).
group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". ungroup() removes grouping. The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions.
In conclusion, dplyr is pretty fast (way faster than base R or plyr) but data. table is somewhat faster especially for very large datasets and a large number of groups. For datasets under a million rows operations on dplyr (or data.
You can use tail
inside do
.
df2 <- df %>% filter(x>3) %>% group_by(g) %>% do(tail(., n=10))
The use of .
is key for this to work. From the do
help page: "You can use . to refer to the current group."
Edit:
As @beginneR pointed out, I was focusing on how to use tail
in groups with dplyr
and missed the part of the question where the OP asked for the 10 lowest values of y
. To do this correctly would take the addition of arrange
. With tail
, this would mean arranging by descending order of y
.
df2 <- df %>% filter(x>3) %>% group_by(g) %>% arrange(desc(y)) %>% do(tail(., n=10))
Here are two other options:
df %>% filter(x>3) %>% group_by(g) %>% top_n(3, desc(y))
Here we make use of top_n
but use desc(y)
since we want the lowest y
values instead of the largest ("top") y
values.
df %>% filter(x>3) %>% group_by(g) %>% arrange(y) %>% filter(1:n() <= 10)
which is equal to
df %>% filter(x>3) %>% group_by(g) %>% arrange(y) %>% slice(1:10)
After the grouping, we sort each group by increasing y
and then select the first 10 rows per group (or less if there are not 10 rows in a group).
Since there was some confusion about lowest and last values to be selected: this answer selects the lowest values, not the last entries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With