I've learned that <code>Do</code> function is used when you want to apply a function to each group. for example, if I want to pull top 2 rows from "A", "C", and "I" categories of variable <code>Index</code>, following syntax can be used. <pre class="prettyprint"><code>t <- mydata %>% filter(Index %in% c("A", "C", "I")) %>% group_by(Index) %>% do(head(.,2)) </code></pre> I understand that after grouping by index, <code>do</code> function is used to compute head(.,2) for each group. However, on some occasions, <code>do</code> is not used at all. For example, To compute mean of variable <code>Y2014</code> grouped by variable <code>Index</code>, I thought that following code should be used. <pre class="prettyprint"><code>t <- mydata %>% group_by(Index) %>% do(summarise(Mean_2014 = mean(Y2014))) </code></pre> however, above syntax returns error <pre class="prettyprint"><code>Error in mean(Y2014) : object 'Y2014' not found </code></pre> But if I remove <code>do</code> from the syntax, it returns what I exactly wanted. <pre class="prettyprint"><code>t <- mydata %>% group_by(Index) %>% summarise(Mean_2014 = mean(Y2014)) </code></pre> I'm really confused about usage of <code>do</code> function in dplyr. It seems inconsistent to me. When should I use and not use <code>do</code> function? Why should I use <code>do</code> in the first case and not in the second case?

The comments under the question discuss that in many cases you can find an alternative in dplyr or associated packages that avoid the use of <code>do</code> and the examples in the question are of that sort; however, to answer the question directly rather than via alternatives: <h3>Differences between using do and not using it</h3> Within the context of data frames, the key differences between using <code>do</code> and not using <code>do</code> are: <ol> <li>No automatic insertion of dot The code within the <code>do</code> will not have dot automatically inserted into the first argument. For example, instead of the <code>do(summarise(Mean_2014 = mean(Y2014)))</code> code in the question one would have to write <code>do(summarise(., Mean_2014 = mean(Y2014)))</code> with a dot since the dot is not automatically inserted. This is a consequence of <code>do</code> being the right hand side function of <code>%>%</code> rather than <code>summarize</code>. Although this is important to understand so that we insert dot when needed if the purpose were simply to avoid automatic insertion of dot into the first argument we could alternately use brace brackets to get that effect: <code>whatever %>% { myfun(arg1, arg2) }</code> would also not automatically insert dot as the first argument of the <code>myfun</code> call.</li> <li> respecting group_by Only functions specifically written to respect <code>group_by</code> will do so. There are two issues here. (1) Only functions specifically written to respect <code>group_by</code> will be run once for each group. <code>mutate</code>, <code>summarize</code> and <code>do</code> are examples of functions that run once per group (there are others too). (2) Even if the function is run once for each group there is the question of how dot is handled. We focus on two cases (not a complete list): (i) if <code>do</code> is not used then if dot is used within a function call within an expression to an argument it will refer to the entire input ignoring <code>group_by</code>. Presumably this is a consequence of magrittr's dot substitution rules and it not knowing anything about <code>group_by</code>. On the other hand (ii) within <code>do</code> dot always refers to the rows of the current group. For example, compare the output of these two and note that dot refers to 3 rows in the first case where <code>do</code> is used and all 6 rows in the second where it is not. This is despite the fact that <code>summarize</code> respects <code>group_by</code> in that it runs once per group. <pre class="prettyprint"><code>BOD$g <- c(1, 1, 1, 2, 2, 2) BOD %>% group_by(g) %>% do(summarize(., nr = nrow(.))) ## # A tibble: 2 x 2 ## # Groups: g [2] ## g nr ## <dbl> <int> ## 1 1.00 3 ## 2 2.00 3 BOD %>% group_by(g) %>% summarize(nr = nrow(.)) ## # A tibble: 2 x 2 ## g nr ## <dbl> <int> ## 1 1.00 6 ## 2 2.00 6 </code></pre> </li> </ol> See <code>?do</code> for more information. <h3>Code from Question</h3> Now we go through the code in the question. As <code>mydata</code> was never defined in the question we use the first line of code below to define it to facilitate concrete examples. <pre class="prettyprint"><code>mydata <- data.frame(Index = rep(c("A", "C", "I"), each = 3), Y2014 = 1) mydata %>% filter(Index %in% c("A", "C", "I")) %>% group_by(Index) %>% do(head(., 2)) ## # A tibble: 6 x 2 ## # Groups: Index [3] ## Index Y2014 ## <fctr> <dbl> ## 1 A 1.00 ## 2 A 1.00 ## 3 C 1.00 ## 4 C 1.00 ## 5 I 1.00 ## 6 I 1.00 </code></pre> The code above produces 2 rows for each of the 3 groups giving 6 rows. Had we omitted <code>do</code> then it would disregard <code>group_by</code> and produce only two rows with dot being regarded as the entire 9 rows of input, not just each group at a time. (In this particular case dplyr provides its own alternative to <code>head</code> that avoids these problems but for sake of illustrating the general point we stick to the code in the question.) The following code from the question generates an error because dot insertion is not done within <code>do</code> and so what ought to be the first argument of summarize, i.e. the data frame input, is missing: <pre class="prettyprint"><code>mydata %>% group_by(Index) %>% do(summarise(Mean_2014 = mean(Y2014))) ## Error in mean(Y2014) : object 'Y2014' not found </code></pre> If we remove the <code>do</code> in the above code, as in the last line of code in the question, then it works since the dot insertion is performed. Alternately if we add the dot <code>do(summarise(., Mean_2014 = mean(Y2014)))</code> it would also work although <code>do</code> really seems superfluous in this case as <code>summarize</code> already respects <code>group_by</code> so there is no need to wrap it in <code>do</code>. <pre class="prettyprint"><code>mydata %>% group_by(Index) %>% summarise(Mean_2014 = mean(Y2014)) ## # A tibble: 3 x 2 ## Index Mean_2014 ## <fctr> <dbl> ## 1 A 1.00 ## 2 C 1.00 ## 3 I 1.00 </code></pre>

When to use "Do" function in dplyr

Tags:

I've learned that Do function is used when you want to apply a function to each group.

for example, if I want to pull top 2 rows from "A", "C", and "I" categories of variable Index, following syntax can be used.

t <- mydata %>% filter(Index %in% c("A", "C", "I")) %>% group_by(Index) %>% do(head(.,2))

I understand that after grouping by index, do function is used to compute head(.,2) for each group.

However, on some occasions, do is not used at all. For example, To compute mean of variable Y2014 grouped by variable Index, I thought that following code should be used.

t <- mydata %>% group_by(Index) %>% do(summarise(Mean_2014 = mean(Y2014)))

however, above syntax returns error

Error in mean(Y2014) : object 'Y2014' not found

But if I remove do from the syntax, it returns what I exactly wanted.

t <- mydata %>% group_by(Index) %>% summarise(Mean_2014 = mean(Y2014))

I'm really confused about usage of do function in dplyr. It seems inconsistent to me. When should I use and not use do function? Why should I use do in the first case and not in the second case?

970

asked Jan 10 '18 08:01

Daniel Cho

1 Answers

The comments under the question discuss that in many cases you can find an alternative in dplyr or associated packages that avoid the use of do and the examples in the question are of that sort; however, to answer the question directly rather than via alternatives:

Differences between using do and not using it

Within the context of data frames, the key differences between using do and not using do are:

No automatic insertion of dot The code within the do will not have dot automatically inserted into the first argument. For example, instead of the do(summarise(Mean_2014 = mean(Y2014))) code in the question one would have to write do(summarise(., Mean_2014 = mean(Y2014))) with a dot since the dot is not automatically inserted. This is a consequence of do being the right hand side function of %>% rather than summarize. Although this is important to understand so that we insert dot when needed if the purpose were simply to avoid automatic insertion of dot into the first argument we could alternately use brace brackets to get that effect: whatever %>% { myfun(arg1, arg2) } would also not automatically insert dot as the first argument of the myfun call.
respecting group_by Only functions specifically written to respect group_by will do so. There are two issues here. (1) Only functions specifically written to respect group_by will be run once for each group. mutate, summarize and do are examples of functions that run once per group (there are others too). (2) Even if the function is run once for each group there is the question of how dot is handled. We focus on two cases (not a complete list): (i) if do is not used then if dot is used within a function call within an expression to an argument it will refer to the entire input ignoring group_by. Presumably this is a consequence of magrittr's dot substitution rules and it not knowing anything about group_by. On the other hand (ii) within do dot always refers to the rows of the current group. For example, compare the output of these two and note that dot refers to 3 rows in the first case where do is used and all 6 rows in the second where it is not. This is despite the fact that summarize respects group_by in that it runs once per group.
```
BOD$g <- c(1, 1, 1, 2, 2, 2)
BOD %>% group_by(g) %>% do(summarize(., nr = nrow(.)))
## # A tibble: 2 x 2
## # Groups: g [2]
## g nr
## <dbl> <int>
## 1 1.00 3
## 2 2.00 3

BOD %>% group_by(g) %>% summarize(nr = nrow(.))
## # A tibble: 2 x 2
## g nr
## <dbl> <int>
## 1 1.00 6
## 2 2.00 6
```

See ?do for more information.

Code from Question

Now we go through the code in the question. As mydata was never defined in the question we use the first line of code below to define it to facilitate concrete examples.

mydata <- data.frame(Index = rep(c("A", "C", "I"), each = 3), Y2014 = 1)

mydata %>% 
       filter(Index %in% c("A", "C", "I")) %>% 
       group_by(Index) %>% 
       do(head(., 2))

## # A tibble: 6 x 2
## # Groups: Index [3]
##   Index  Y2014
##   <fctr> <dbl>
## 1 A       1.00
## 2 A       1.00
## 3 C       1.00
## 4 C       1.00
## 5 I       1.00
## 6 I       1.00

The code above produces 2 rows for each of the 3 groups giving 6 rows. Had we omitted do then it would disregard group_by and produce only two rows with dot being regarded as the entire 9 rows of input, not just each group at a time. (In this particular case dplyr provides its own alternative to head that avoids these problems but for sake of illustrating the general point we stick to the code in the question.)

The following code from the question generates an error because dot insertion is not done within do and so what ought to be the first argument of summarize, i.e. the data frame input, is missing:

mydata %>% 
       group_by(Index) %>% 
       do(summarise(Mean_2014 = mean(Y2014)))
## Error in mean(Y2014) : object 'Y2014' not found

If we remove the do in the above code, as in the last line of code in the question, then it works since the dot insertion is performed. Alternately if we add the dot do(summarise(., Mean_2014 = mean(Y2014))) it would also work although do really seems superfluous in this case as summarize already respects group_by so there is no need to wrap it in do.

mydata %>% 
       group_by(Index) %>% 
       summarise(Mean_2014 = mean(Y2014))

## # A tibble: 3 x 2
##   Index  Mean_2014
##   <fctr>     <dbl>
## 1 A           1.00
## 2 C           1.00
## 3 I           1.00

answered Oct 12 '22 22:10

G. Grothendieck

Related questions
                            
                                Using keras tokenizer for new words not in training set
                            
                                Is the address of a std::optional's value stable?
                            
                                ggplot2 geom_smooth, extended model for method=lm
                            
                                Apply function to grouped data frame in Dask: How do you specify the grouped Dataframe as argument in the function?
                            
                                specific Object / Image detection from app locally, without internet
                            
                                Testing a @KafkaListener using Spring Embedded Kafka
                            
                                Should I put the ID of my entity in the URL or into the form as a hidden field?
                            
                                Kotlin convert List to vararg
                            
                                logits and labels must be broadcastable error in Tensorflow RNN
                            
                                What is the difference between NestedScrollView and CustomScrollView?
                            
                                Why does Arrays.asList(null) throw a NullPointerException while Arrays.asList(someNullVariable) does not? [duplicate]
                            
                                MockRestServiceServer: how to mock a POST call with a body?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With