I've learned that Do
function is used when you want to apply a function to each group.
for example, if I want to pull top 2 rows from "A", "C", and "I" categories of variable Index
, following syntax can be used.
t <- mydata %>% filter(Index %in% c("A", "C", "I")) %>% group_by(Index) %>% do(head(.,2))
I understand that after grouping by index, do
function is used to compute head(.,2) for each group.
However, on some occasions, do
is not used at all. For example, To compute mean of variable Y2014
grouped by variable Index
, I thought that following code should be used.
t <- mydata %>% group_by(Index) %>% do(summarise(Mean_2014 = mean(Y2014)))
however, above syntax returns error
Error in mean(Y2014) : object 'Y2014' not found
But if I remove do
from the syntax, it returns what I exactly wanted.
t <- mydata %>% group_by(Index) %>% summarise(Mean_2014 = mean(Y2014))
I'm really confused about usage of do
function in dplyr. It seems inconsistent to me. When should I use and not use do
function? Why should I use do
in the first case and not in the second case?
call() function in R constructs and executes a function call from a name or a function as well as a list of arguments to be passed to it.
dplyr aims to provide a function for each basic verb of data manipulation. These verbs can be organised into three categories based on the component of the dataset that they work with: Rows: filter() chooses rows based on column values.
The group_by() function in R is from dplyr package that is used to group rows by column values in the DataFrame, It is similar to GROUP BY clause in SQL. R dplyr groupby is used to collect identical data into groups on DataFrame and perform aggregate functions on the grouped data.
%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).
The comments under the question discuss that in many cases you can find an alternative in dplyr or associated packages that avoid the use of do
and the examples in the question are of that sort; however, to answer the question directly rather than via alternatives:
Within the context of data frames, the key differences between using do
and not using do
are:
No automatic insertion of dot The code within the do
will not have dot automatically inserted into the first argument. For example, instead of the do(summarise(Mean_2014 = mean(Y2014)))
code in the question one would have to write do(summarise(., Mean_2014 = mean(Y2014)))
with a dot since the dot is not automatically inserted. This is a consequence of do
being the right hand side function of %>%
rather than summarize
. Although this is important to understand so that we insert dot when needed if the purpose were simply to avoid automatic insertion of dot into the first argument we could alternately use brace brackets to get that effect: whatever %>% { myfun(arg1, arg2) }
would also not automatically insert dot as the first argument of the myfun
call.
respecting group_by Only functions specifically written to respect group_by
will do so. There are two issues here. (1) Only functions specifically written to respect group_by
will be run once for each group. mutate
, summarize
and do
are examples of functions that run once per group (there are others too). (2) Even if the function is run once for each group there is the question of how dot is handled. We focus on two cases (not a complete list): (i) if do
is not used then if dot is used within a function call within an expression to an argument it will refer to the entire input ignoring group_by
. Presumably this is a consequence of magrittr's dot substitution rules and it not knowing anything about group_by
. On the other hand (ii) within do
dot always refers to the rows of the current group. For example, compare the output of these two and note that dot refers to 3 rows in the first case where do
is used and all 6 rows in the second where it is not. This is despite the fact that summarize
respects group_by
in that it runs once per group.
BOD$g <- c(1, 1, 1, 2, 2, 2)
BOD %>% group_by(g) %>% do(summarize(., nr = nrow(.)))
## # A tibble: 2 x 2
## # Groups: g [2]
## g nr
## <dbl> <int>
## 1 1.00 3
## 2 2.00 3
BOD %>% group_by(g) %>% summarize(nr = nrow(.))
## # A tibble: 2 x 2
## g nr
## <dbl> <int>
## 1 1.00 6
## 2 2.00 6
See ?do
for more information.
Now we go through the code in the question. As mydata
was never defined in the question we use the first line of code below to define it to facilitate concrete examples.
mydata <- data.frame(Index = rep(c("A", "C", "I"), each = 3), Y2014 = 1)
mydata %>%
filter(Index %in% c("A", "C", "I")) %>%
group_by(Index) %>%
do(head(., 2))
## # A tibble: 6 x 2
## # Groups: Index [3]
## Index Y2014
## <fctr> <dbl>
## 1 A 1.00
## 2 A 1.00
## 3 C 1.00
## 4 C 1.00
## 5 I 1.00
## 6 I 1.00
The code above produces 2 rows for each of the 3 groups giving 6 rows. Had we omitted do
then it would disregard group_by
and produce only two rows with dot being regarded as the entire 9 rows of input, not just each group at a time. (In this particular case dplyr provides its own alternative to head
that avoids these problems but for sake of illustrating the general point we stick to the code in the question.)
The following code from the question generates an error because dot insertion is not done within do
and so what ought to be the first argument of summarize, i.e. the data frame input, is missing:
mydata %>%
group_by(Index) %>%
do(summarise(Mean_2014 = mean(Y2014)))
## Error in mean(Y2014) : object 'Y2014' not found
If we remove the do
in the above code, as in the last line of code in the question, then it works since the dot insertion is performed. Alternately if we add the dot do(summarise(., Mean_2014 = mean(Y2014)))
it would also work although do
really seems superfluous in this case as summarize
already respects group_by
so there is no need to wrap it in do
.
mydata %>%
group_by(Index) %>%
summarise(Mean_2014 = mean(Y2014))
## # A tibble: 3 x 2
## Index Mean_2014
## <fctr> <dbl>
## 1 A 1.00
## 2 C 1.00
## 3 I 1.00
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With