Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

row-wise operations, select helpers and the mutate function in dplyr

I will use the following data set to illustrate my questions:

my_df <- data.frame(
    a = 1:10,
    b = 10:1
)
colnames(my_df) <- c("a", "b")

Part 1

I use the mutate() function to create two new variables in my data set and I would like to compute the row means of the two new columns inside the same mutate() call. However, I would really like to be able to use the select() helpers such as starts_with(), ends_with() or contains().

My first try:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2,
        mean = rowMeans(select(ends_with("2")))
    )
Error in mutate_impl(.data, dots) : 
  Evaluation error: No tidyselect variables were registered.

I understand why there is an error - the select() function is not given any .data argument. So I change the code in...

... my second try by adding "." inside the select() function:

my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2,
        mean = rowMeans(select(., ends_with("2")))
    )
    a  b a_2 b_2 mean
1   1 10   1 100  NaN
2   2  9   4  81  NaN
3   3  8   9  64  NaN
4   4  7  16  49  NaN
5   5  6  25  36  NaN
6   6  5  36  25  NaN
7   7  4  49  16  NaN
8   8  3  64   9  NaN
9   9  2  81   4  NaN
10 10  1 100   1  NaN

The new problem after the second try is that the mean column does not contain the mean of a_2 and b_2 as expected, but contains NaNs only. After studying the code a bit, I understood the second problem. The added "." in the select() function refers to the original my_df data frame, which does not have the a_2 and b_2 columns. So it makes sense that NaNs are produced because I am asking R to compute the means of nonexistent values.

I then tried to use dplyr functions such as current_vars() to see if it would make a difference:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2,
        mean = rowMeans(select(current_vars(), ends_with("2")))
    )
Error in mutate_impl(.data, dots) : 
  Evaluation error: Variable context not set.

However, this is obviously NOT the way to use this function. The solution is to simply add a second mutate() function:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2
    ) %>%
    mutate(mean = rowMeans(select(., ends_with("2"))))
    a  b a_2 b_2 mean
1   1 10   1 100 50.5
2   2  9   4  81 42.5
3   3  8   9  64 36.5
4   4  7  16  49 32.5
5   5  6  25  36 30.5
6   6  5  36  25 30.5
7   7  4  49  16 32.5
8   8  3  64   9 36.5
9   9  2  81   4 42.5
10 10  1 100   1 50.5

Question 1: Is there any way to perform this task in the same mutate() call? Using a second mutate() function is not really an issue anyway; however, I am curious to know if there exists a way to refer to currently existing variables. The mutate() function allows for the usage of variables as soon as they are created inside the same mutate() call; however, this becomes problematic when functions are nested as shown in my example above.

Part 2

I also realize that using rowMeans() works in my solution; however, it is not really a dplyr-way of doing things especially because I need to use select() inside it. So, I decided to use the rowwise() and mean() functions instead. But once again, I would like to use one of the select() helpers for that and not have to list all variables in a c() function. I tried:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2
    ) %>%
    rowwise() %>%
    mutate(
        mean = mean(ends_with("2"))
    )
Error in mutate_impl(.data, dots) : 
  Evaluation error: No tidyselect variables were registered.

I suspect that the error in the code is due to the fact that ends_with() is not inside select(), but I am showing this to ask whether there is a way to list the variables I want without having to specify them individually.

Thank you for your time.

like image 769
SavedByJESUS Avatar asked Jan 20 '18 06:01

SavedByJESUS


People also ask

What is mutate in dplyr?

mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name. Variables can be removed by setting their value to NULL .

What does mutate () do in R?

In R programming, the mutate function is used to create a new variable from a data set. In order to use the function, we need to install the dplyr package, which is an add-on to R that includes a host of cool functions for selecting, filtering, grouping, and arranging data.

What is row wise in R?

rowwise.Rd. rowwise() allows you to compute on a data frame a row-at-a-time. This is most useful when a vectorised function doesn't exist. Most dplyr verbs preserve row-wise grouping. The exception is summarise() , which return a grouped_df.

Which are 5 of the most commonly used dplyr functions?

This article will cover the five verbs of dplyr: select, filter, arrange, mutate, and summarize.


1 Answers

A bit late, but here is a solution to problem 1, for the reference.

If you had to do it without pipes, you would write:

tmp1 = mutate(my_df, a_2 = a^2, b_2 = b^2)
tmp2 = select(tmp1, ends_with("2"))
tmp3 = rowMeans(tmp2)
tmp4 = mutate(tmp1, m=tmp3)

Or, with less intermediate steps:

tmp1 = mutate(my_df, a_2 = a^2, b_2 = b^2)
tmp4 = mutate(tmp1, m=rowMeans(select(tmp1, ends_with("2"))) )

Note that computing tmp4 requires using tmp1 twice. So in the piped version you will need also to reference . explicitly a second time (as usual the first reference is implicit, as the first argument to mutate):

my_df %>%
  mutate(a_2 = a^2, b_2 = b^2) %>%
  mutate(mean = rowMeans(select(., ends_with("2"))) )

For problem #2: avoiding the call rowMeans is trickier, and maybe not desirable (?)

like image 135
Pierre Gramme Avatar answered Nov 15 '22 03:11

Pierre Gramme