If I wanted to sum over some variables in a data-frame using <code>dplyr</code>, I could do: <pre class="prettyprint"><code>> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa > select(iris, starts_with('Petal')) %>% rowSums() [1] 1.6 1.6 1.5 1.7 1.6 2.1 1.7 1.7 1.6 1.6 1.7 1.8 1.5 1.2 1.4 1.9 1.7 1.7 2.0 1.8 1.9 1.9 1.2 2.2 2.1 1.8 2.0 1.7 1.6 1.8 1.8 1.9 1.6 1.6 1.7 1.4 [37] 1.5 1.5 1.5 1.7 1.6 1.6 1.5 2.2 2.3 1.7 1.8 1.6 1.7 1.6 6.1 6.0 6.4 5.3 6.1 5.8 6.3 4.3 5.9 5.3 4.5 5.7 5.0 6.1 4.9 5.8 6.0 5.1 6.0 5.0 6.6 5.3 [73] 6.4 5.9 5.6 5.8 6.2 6.7 6.0 4.5 4.9 4.7 5.1 6.7 6.0 6.1 6.2 5.7 5.4 5.3 5.6 6.0 5.2 4.3 5.5 5.4 5.5 5.6 4.1 5.4 8.5 7.0 8.0 7.4 8.0 8.7 6.2 8.1 [109] 7.6 8.6 7.1 7.2 7.6 7.0 7.5 7.6 7.3 8.9 9.2 6.5 8.0 6.9 8.7 6.7 7.8 7.8 6.6 6.7 7.7 7.4 8.0 8.4 7.8 6.6 7.0 8.4 8.0 7.3 6.6 7.5 8.0 7.4 7.0 8.2 [145] 8.2 7.5 6.9 7.2 7.7 6.9 </code></pre> That's fine, but I would have thought <code>rowwise</code> accomplishes the same thing, but it doesn't, <pre class="prettyprint"><code>> select(iris, starts_with('Petal')) %>% rowwise() %>% sum() [1] 743.6 </code></pre> What I particularly want to do is select a set of columns, and create a new variable each value of which is the maximum value of each row of the selected columns. For example, if I selected the "Petal" columns, by maximum values would be 1.4, 1.4, 1.3 and so on. I could do it like this: <pre class="prettyprint"><code>> select(iris, starts_with('Petal')) %>% apply(1, max) </code></pre> and that's fine. But I'm just curious as to why the <code>rowwise</code> approach doesn't work. I realize I am using <code>rowwise</code> incorrectly, I'm just not sure why it is wrong.

You can skip the use of <code>select</code> if you use <code>c_across</code> to select the variables you want to sum: <pre class="prettyprint"><code>iris %>% rowwise() %>% mutate(sum = sum(c_across(starts_with("Petal"))), .keep = "used") %>% ungroup() </code></pre> Output If you want keep all the columns in your data frame then remove the <code>.keep</code> argument. <pre class="prettyprint"><code> Petal.Length Petal.Width sum <dbl> <dbl> <dbl> 1 1.4 0.2 1.6 2 1.4 0.2 1.6 3 1.3 0.2 1.5 4 1.5 0.2 1.7 5 1.4 0.2 1.6 6 1.7 0.4 2.1 7 1.4 0.3 1.7 8 1.5 0.2 1.7 9 1.4 0.2 1.6 10 1.5 0.1 1.6 # ... with 140 more rows </code></pre> <hr> Similarly, with <code>max</code>: <pre class="prettyprint"><code>iris %>% rowwise() %>% mutate(max = max(c_across(starts_with("Petal"))), .keep = "used") %>% ungroup() </code></pre>

In short: you are expecting the "sum" function to be aware of <code>dplyr</code> data structures like a data frame grouped by row. <code>sum</code> is not aware of it so it just takes the sum of the whole <code>data.frame</code>. Here is a brief explanation. This: <pre class="prettyprint"><code>select(iris, starts_with('Petal')) %>% rowwise() %>% sum() </code></pre> Can be rewritten without using the pipe operator as the following: <pre class="prettyprint"><code>data <- select(iris, starts_with('Petal')) data <- rowwise(data) sum(data) </code></pre> As you can see you were constructing something called a <code>tibble</code>. Then the <code>rowwise</code> call adds additional information on this object and specifies that it should be grouped row-wise. However only the functions aware of this grouping like <code>summarize</code> and <code>mutate</code> can work like intended. Base R functions like <code>sum</code> are not aware of these objects and treat them as any standard <code>data.frame</code>s. And the standard approach for <code>sum()</code> is to sum the entire data frame. Using <code>mutate</code> works: <pre class="prettyprint"><code>select(iris, starts_with('Petal')) %>% rowwise() %>% mutate(sum = sum(Petal.Width, Petal.Length)) </code></pre> Result: <pre class="prettyprint"><code>Source: local data frame [150 x 3] Groups: <by row> # A tibble: 150 x 3 Petal.Length Petal.Width sum <dbl> <dbl> <dbl> 1 1.40 0.200 1.60 2 1.40 0.200 1.60 3 1.30 0.200 1.50 ... </code></pre>

The problem is that the entire data frame is passed as dot despite the <code>rowwise</code>. To handle this use <code>do</code> which will interpret dot as meaning just the current row. One further problem is that the dot within <code>do</code> will represent the row as a list so convert it appropriately. <pre class="prettyprint"><code>library(dplyr) iris %>% slice(1:6) %>% select(starts_with('Petal')) %>% rowwise() %>% do( (.) %>% as.data.frame %>% mutate(sum = sum(.)) ) %>% ungroup </code></pre> giving: <pre class="prettyprint"><code># A tibble: 6 x 3 Petal.Length Petal.Width sum * <dbl> <dbl> <dbl> 1 1.40 0.200 1.60 2 1.40 0.200 1.60 3 1.30 0.200 1.50 4 1.50 0.200 1.70 5 1.40 0.200 1.60 6 1.70 0.400 2.10 </code></pre> <h3>dplyr 1.0 - added later</h3> Since this was asked dplyr 1.0 was released and it has <code>cur_data()</code> which can be used to simplify the above eliminating the need for <code>do</code>. <code>cur_data()</code> within a <code>rowwise</code> block refers only to the current row. <pre class="prettyprint"><code>iris %>% slice(1:6) %>% select(starts_with('Petal')) %>% rowwise() %>% mutate(sum = sum(cur_data())) %>% ungroup </code></pre>

dplyr rowwise sum and other functions like max

Tags:

dataframe

r

dplyr

rowwise

If I wanted to sum over some variables in a data-frame using dplyr, I could do:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

> select(iris, starts_with('Petal')) %>% rowSums()
  [1] 1.6 1.6 1.5 1.7 1.6 2.1 1.7 1.7 1.6 1.6 1.7 1.8 1.5 1.2 1.4 1.9 1.7 1.7 2.0 1.8 1.9 1.9 1.2 2.2 2.1 1.8 2.0 1.7 1.6 1.8 1.8 1.9 1.6 1.6 1.7 1.4
 [37] 1.5 1.5 1.5 1.7 1.6 1.6 1.5 2.2 2.3 1.7 1.8 1.6 1.7 1.6 6.1 6.0 6.4 5.3 6.1 5.8 6.3 4.3 5.9 5.3 4.5 5.7 5.0 6.1 4.9 5.8 6.0 5.1 6.0 5.0 6.6 5.3
 [73] 6.4 5.9 5.6 5.8 6.2 6.7 6.0 4.5 4.9 4.7 5.1 6.7 6.0 6.1 6.2 5.7 5.4 5.3 5.6 6.0 5.2 4.3 5.5 5.4 5.5 5.6 4.1 5.4 8.5 7.0 8.0 7.4 8.0 8.7 6.2 8.1
[109] 7.6 8.6 7.1 7.2 7.6 7.0 7.5 7.6 7.3 8.9 9.2 6.5 8.0 6.9 8.7 6.7 7.8 7.8 6.6 6.7 7.7 7.4 8.0 8.4 7.8 6.6 7.0 8.4 8.0 7.3 6.6 7.5 8.0 7.4 7.0 8.2
[145] 8.2 7.5 6.9 7.2 7.7 6.9

That's fine, but I would have thought rowwise accomplishes the same thing, but it doesn't,

> select(iris, starts_with('Petal')) %>% rowwise() %>% sum()
[1] 743.6

What I particularly want to do is select a set of columns, and create a new variable each value of which is the maximum value of each row of the selected columns. For example, if I selected the "Petal" columns, by maximum values would be 1.4, 1.4, 1.3 and so on.

I could do it like this:

> select(iris, starts_with('Petal')) %>% apply(1, max)

and that's fine. But I'm just curious as to why the rowwise approach doesn't work. I realize I am using rowwise incorrectly, I'm just not sure why it is wrong.

979

asked Mar 21 '18 00:03

mjandrews

3 Answers

You can skip the use of select if you use c_across to select the variables you want to sum:

iris %>% 
  rowwise() %>% 
  mutate(sum = sum(c_across(starts_with("Petal"))), .keep = "used") %>% 
  ungroup()

Output

If you want keep all the columns in your data frame then remove the .keep argument.

 Petal.Length Petal.Width   sum
          <dbl>       <dbl> <dbl>
 1          1.4         0.2   1.6
 2          1.4         0.2   1.6
 3          1.3         0.2   1.5
 4          1.5         0.2   1.7
 5          1.4         0.2   1.6
 6          1.7         0.4   2.1
 7          1.4         0.3   1.7
 8          1.5         0.2   1.7
 9          1.4         0.2   1.6
10          1.5         0.1   1.6
# ... with 140 more rows

Similarly, with max:

iris %>% 
    rowwise() %>% 
    mutate(max = max(c_across(starts_with("Petal"))), .keep = "used") %>% 
    ungroup()

answered Oct 04 '22 07:10

LMc

In short: you are expecting the "sum" function to be aware of dplyr data structures like a data frame grouped by row. sum is not aware of it so it just takes the sum of the whole data.frame.

Here is a brief explanation. This:

select(iris, starts_with('Petal')) %>% rowwise() %>% sum()

Can be rewritten without using the pipe operator as the following:

data <- select(iris, starts_with('Petal'))
data <- rowwise(data)
sum(data)

As you can see you were constructing something called a tibble. Then the rowwise call adds additional information on this object and specifies that it should be grouped row-wise.

However only the functions aware of this grouping like summarize and mutate can work like intended. Base R functions like sum are not aware of these objects and treat them as any standard data.frames. And the standard approach for sum() is to sum the entire data frame.

Using mutate works:

select(iris, starts_with('Petal')) %>%
  rowwise() %>%
  mutate(sum = sum(Petal.Width, Petal.Length))

Result:

Source: local data frame [150 x 3]
Groups: <by row>

# A tibble: 150 x 3
   Petal.Length Petal.Width   sum
          <dbl>       <dbl> <dbl>
 1         1.40       0.200  1.60
 2         1.40       0.200  1.60
 3         1.30       0.200  1.50
 ...

answered Oct 04 '22 05:10

Karolis Koncevičius

The problem is that the entire data frame is passed as dot despite the rowwise. To handle this use do which will interpret dot as meaning just the current row. One further problem is that the dot within do will represent the row as a list so convert it appropriately.

library(dplyr)

iris %>%
  slice(1:6) %>%
  select(starts_with('Petal')) %>% 
  rowwise() %>%
  do( (.) %>% as.data.frame %>% mutate(sum = sum(.)) ) %>%
  ungroup

giving:

# A tibble: 6 x 3
  Petal.Length Petal.Width   sum
*        <dbl>       <dbl> <dbl>
1         1.40       0.200  1.60
2         1.40       0.200  1.60
3         1.30       0.200  1.50
4         1.50       0.200  1.70
5         1.40       0.200  1.60
6         1.70       0.400  2.10

dplyr 1.0 - added later

Since this was asked dplyr 1.0 was released and it has cur_data() which can be used to simplify the above eliminating the need for do. cur_data() within a rowwise block refers only to the current row.

iris %>%
  slice(1:6) %>%
  select(starts_with('Petal')) %>% 
  rowwise() %>%
  mutate(sum = sum(cur_data())) %>%
  ungroup

answered Oct 04 '22 06:10

G. Grothendieck

Related questions
                            
                                Seasonal decompose of monthly data including NA in r
                            
                                R: How to overload the `-` operator if you want use `-` in your 'overload' code?
                            
                                function(x) in R: writing a "function" without defining a function?
                            
                                Adding a legend to an rgl 3d plot
                            
                                Efficient linked list (ordered set) in R
                            
                                R stargazer, lme4 and lmerTest incompatibility
                            
                                Rmarkdown removes citation hyperlink
                            
                                Fill different colors for each quantile in geom_density() of ggplot [duplicate]
                            
                                Breaking out of nested loops in R
                            
                                Disabling buttons in Shiny
                            
                                R: What do you call the :: and ::: operators and how do they differ?
                            
                                Apply a dataframe of boolean values onto another dataframe in R
                            
                                R script in Power BI returns date as Microsoft.OleDb.Date
                            
                                using purrr to affect single columns of each dataframe in a list
                            
                                how to change the color in geom_point or lines in ggplot [duplicate]
                            
                                How to use the Google satellite view as tile in leaflet with R
                            
                                Saving ggplot graph to PDF with fonts embedded in r
                            
                                Rscript detect if R script is being called/sourced from another script
                            
                                Merge and Perfectly Align Histogram and Boxplot using ggplot2
                            
                                How to download entire repository from Github using R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With