Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mutate_each / summarise_each in dplyr: how do I select certain columns and give new names to mutated columns?

Tags:

dataframe

r

dplyr

I'm a bit confused about the dplyr verb mutate_each.

It's pretty straightforward to use the basic mutate to transform a column of data into, say, z-scores, and create a new column in your data.frame (here with the name z_score_data):

newDF <- DF %>%   select(one_column) %>%   mutate(z_score_data = one_column - (mean(one_column) / sd(one_column)) 

However, since I have many columns of data I'd like to transform, it appears I should probably use the mutate_each verb.

newDF <- DF %>%      mutate_each(funs(scale)) 

So far so good. But as of yet I haven't been able to figure out:

  1. How can I give these new columns appropriate names, like I can in mutate?
  2. How can I select certain columns that I wish to mutate, like I did with select in the first case?

Thanks for your help.

like image 584
tumultous_rooster Avatar asked Nov 19 '14 21:11

tumultous_rooster


2 Answers

Update for dplyr >= 0.4.3.9000

In the dplyr development version 0.4.3.9000 (at time of writing), naming inside mutate_each and summarise_each has been simplified as noted in the News:

The naming behaviour of summarise_each() and mutate_each() has been tweaked so that you can force inclusion of both the function and the variable name: summarise_each(mtcars, funs(mean = mean), everything())

This is mainly important if you want to apply only 1 function inside mutate_each / summarise_each and you want to give those column new names.

To show the difference, here's the output from dplyr 0.4.3.9000 using the new naming functionality, in contrast to option a.2 below:

library(dplyr) # >= 0.4.3.9000 iris %>% mutate_each(funs(mysum = sum(.)), -Species) %>% head() #  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_mysum Sepal.Width_mysum #1          5.1         3.5          1.4         0.2  setosa              876.5             458.6 #2          4.9         3.0          1.4         0.2  setosa              876.5             458.6 #3          4.7         3.2          1.3         0.2  setosa              876.5             458.6 #4          4.6         3.1          1.5         0.2  setosa              876.5             458.6 #5          5.0         3.6          1.4         0.2  setosa              876.5             458.6 #6          5.4         3.9          1.7         0.4  setosa              876.5             458.6 #  Petal.Length_mysum Petal.Width_mysum #1              563.7             179.9 #2              563.7             179.9 #3              563.7             179.9 #4              563.7             179.9 #5              563.7             179.9 #6              563.7             179.9 

If you don't supply new names and you only supply 1 function, dplyr will change the existing columns (as it did in previous versions):

iris %>% mutate_each(funs(sum), -Species) %>% head() #  Sepal.Length Sepal.Width Petal.Length Petal.Width Species #1        876.5       458.6        563.7       179.9  setosa #2        876.5       458.6        563.7       179.9  setosa #3        876.5       458.6        563.7       179.9  setosa #4        876.5       458.6        563.7       179.9  setosa #5        876.5       458.6        563.7       179.9  setosa #6        876.5       458.6        563.7       179.9  setosa 

I assume that this new functionality will be available via CRAN in the next release version 0.4.4.


dplyr verions <= 0.4.3:

How can I give these new columns appropriate names, like I can in mutate?

a) 1 function applied in mutate_each/summarise_each

If you apply only 1 function inside the mutate_each or summarise_each, the existing columns will be transformed and the names will be kept as they used to be, unless you supply a named vector to mutate_each_/summarise_each_ (see option a.4)

Here are some examples:

a.1 only 1 function -> will keep the existing names

iris %>% mutate_each(funs(sum), -Species) %>% head() #  Sepal.Length Sepal.Width Petal.Length Petal.Width Species #1          876         459          564         180  setosa #2          876         459          564         180  setosa #3          876         459          564         180  setosa #4          876         459          564         180  setosa #5          876         459          564         180  setosa #6          876         459          564         180  setosa 

a.2 also if you specify a new column name extension:

iris %>% mutate_each(funs(mysum = sum(.)), -Species) %>% head() #  Sepal.Length Sepal.Width Petal.Length Petal.Width Species #1          876         459          564         180  setosa #2          876         459          564         180  setosa #3          876         459          564         180  setosa #4          876         459          564         180  setosa #5          876         459          564         180  setosa #6          876         459          564         180  setosa 

a.3 Manually specify a new name per column (but only practical for few columns):

iris %>% mutate_each(funs(sum), SLsum = Sepal.Length,SWsum = Sepal.Width,  -Species) %>% head() #  Sepal.Length Sepal.Width Petal.Length Petal.Width Species SLsum SWsum #1          5.1         3.5          1.4         0.2  setosa   876   459 #2          4.9         3.0          1.4         0.2  setosa   876   459 #3          4.7         3.2          1.3         0.2  setosa   876   459 #4          4.6         3.1          1.5         0.2  setosa   876   459 #5          5.0         3.6          1.4         0.2  setosa   876   459 #6          5.4         3.9          1.7         0.4  setosa   876   459 

a.4 Use a named vector to create additional columns with new names:

case 1: keep original columns

In contrast to options a.1, a.2 and a.3, dplyr will keep the existing columns unchanged and create new columns in this approach. The names of the new columns equal the names of the named vector you create in advance (vars in this case).

vars <- names(iris)[1:2]  # choose which columns should be mutated vars <- setNames(vars, paste0(vars, "_sum")) # create new column names iris %>% mutate_each_(funs(sum), vars) %>% head  #  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_sum Sepal.Width_sum #1          5.1         3.5          1.4         0.2  setosa            876.5           458.6 #2          4.9         3.0          1.4         0.2  setosa            876.5           458.6 #3          4.7         3.2          1.3         0.2  setosa            876.5           458.6 #4          4.6         3.1          1.5         0.2  setosa            876.5           458.6 #5          5.0         3.6          1.4         0.2  setosa            876.5           458.6 #6          5.4         3.9          1.7         0.4  setosa            876.5           458.6 

case 2: remove original columns

As you can see, this approach keeps the existing columns unchanged and adds new columns with specified names. In case you don't want to keep the original columns, but just the newly created columns (and the other columns) you can just add a select statement afterwards:

iris %>% mutate_each_(funs(sum), vars) %>% select(-one_of(vars)) %>% head #  Petal.Length Petal.Width Species Sepal.Length_sum Sepal.Width_sum #1          1.4         0.2  setosa            876.5           458.6 #2          1.4         0.2  setosa            876.5           458.6 #3          1.3         0.2  setosa            876.5           458.6 #4          1.5         0.2  setosa            876.5           458.6 #5          1.4         0.2  setosa            876.5           458.6 #6          1.7         0.4  setosa            876.5           458.6 

b) more than 1 function applied in mutate_each/summarise_each

b.1 Let dplyr figure out new names

If you applied more than 1 function, you can let dplyr figure out names by itself (and it will keep the existing columns):

iris %>% mutate_each(funs(sum, mean), -Species) %>% head() #  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_sum Sepal.Width_sum Petal.Length_sum #1          5.1         3.5          1.4         0.2  setosa              876             459              564 #2          4.9         3.0          1.4         0.2  setosa              876             459              564 #3          4.7         3.2          1.3         0.2  setosa              876             459              564 #4          4.6         3.1          1.5         0.2  setosa              876             459              564 #5          5.0         3.6          1.4         0.2  setosa              876             459              564 #6          5.4         3.9          1.7         0.4  setosa              876             459              564 #  Petal.Width_sum Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean #1             180              5.84             3.06              3.76              1.2 #2             180              5.84             3.06              3.76              1.2 #3             180              5.84             3.06              3.76              1.2 #4             180              5.84             3.06              3.76              1.2 #5             180              5.84             3.06              3.76              1.2 #6             180              5.84             3.06              3.76              1.2 

b.2 Manually specify new column names

Another option, when using more than 1 function, is to specify the column name extension on your own:

iris %>% mutate_each(funs(MySum = sum(.), MyMean = mean(.)), -Species) %>% head() #  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_MySum Sepal.Width_MySum Petal.Length_MySum #1          5.1         3.5          1.4         0.2  setosa                876               459                564 #2          4.9         3.0          1.4         0.2  setosa                876               459                564 #3          4.7         3.2          1.3         0.2  setosa                876               459                564 #4          4.6         3.1          1.5         0.2  setosa                876               459                564 #5          5.0         3.6          1.4         0.2  setosa                876               459                564 #6          5.4         3.9          1.7         0.4  setosa                876               459                564 #  Petal.Width_MySum Sepal.Length_MyMean Sepal.Width_MyMean Petal.Length_MyMean Petal.Width_MyMean #1               180                5.84               3.06                3.76                1.2 #2               180                5.84               3.06                3.76                1.2 #3               180                5.84               3.06                3.76                1.2 #4               180                5.84               3.06                3.76                1.2 #5               180                5.84               3.06                3.76                1.2 #6               180                5.84               3.06                3.76                1.2 

How can I select certain columns that I wish to mutate, like I did with select in the first case?

You can do that by referencing the columns to be mutated (or left out) by giving their names like here (mutate Sepal.Length, but not Species):

iris %>% mutate_each(funs(sum), Sepal.Length, -Species) %>% head() 

In addition, you can use special functions to select columns to be mutated, all columns that start with or contain a certain word etc by using for example:

iris %>% mutate_each(funs(sum), contains("Sepal"),  -Species) %>% head() 

For more information of those functions, see ?mutate_each and ?select.

Edit 1 after comment:

If you want to use standard evaluation, dplyr supplies SE-versions of most functions ending with an addtional "_". So in this case you would use:

x <- c("Sepal.Width", "Sepal.Length") # vector of column names  iris %>% mutate_each_(funs(sum), x) %>% head() 

Notice the mutate_each_ I used here.


Edit 2: updated with option a.4

like image 127
talat Avatar answered Sep 24 '22 04:09

talat


mutate_each will be deprecated, consider using mutate_at. From dplyr_0.5.0 documentation:

In the future mutate_each() and summarise_each() will be deprecated in favour of a more featureful family of functions: mutate_all(), mutate_at(), mutate_if(), summarise_all(), summarise_at() and summarise_if().

Apply a function to all variables except Species:

Warning: '.cols' param is deprecated, see note at the bottom!

iris %>% mutate_at(.cols=vars(-Species), .funs=funs(mysum = sum(.))) %>% head()   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_mysum Sepal.Width_mysum 1          5.1         3.5          1.4         0.2  setosa              876.5             458.6 2          4.9         3.0          1.4         0.2  setosa              876.5             458.6 3          4.7         3.2          1.3         0.2  setosa              876.5             458.6 4          4.6         3.1          1.5         0.2  setosa              876.5             458.6 5          5.0         3.6          1.4         0.2  setosa              876.5             458.6 6          5.4         3.9          1.7         0.4  setosa              876.5             458.6   Petal.Length_mysum Petal.Width_mysum 1              563.7             179.9 2              563.7             179.9 3              563.7             179.9 4              563.7             179.9 5              563.7             179.9 6              563.7             179.9 

Apply a function to a subset of variables

vars_to_process=c("Petal.Length","Petal.Width") iris %>% mutate_at(.cols=vars_to_process, .funs=funs(mysum = sum(.))) %>% head()   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Length_mysum Petal.Width_mysum 1          5.1         3.5          1.4         0.2  setosa              563.7             179.9 2          4.9         3.0          1.4         0.2  setosa              563.7             179.9 3          4.7         3.2          1.3         0.2  setosa              563.7             179.9 4          4.6         3.1          1.5         0.2  setosa              563.7             179.9 5          5.0         3.6          1.4         0.2  setosa              563.7             179.9 6          5.4         3.9          1.7         0.4  setosa              563.7             179.9 

Update! for dplyr 0.7.1 version (2017-08-08)

If you see the message:

.cols has been renamed and is deprecated, please use .vars

then change .cols by .vars.

iris %>% mutate_at(.vars=vars(-Species), .funs=funs(mysum = sum(.))) %>% head() 

Another example:

iris %>% mutate_at(.vars=vars(Sepal.Width), .funs=funs(mysum = sum(.))) %>% head() 

Is equivalent to:

iris %>% mutate_at(.vars=vars("Sepal.Width"), .funs=funs(mysum = sum(.))) %>% head() 

Also, in this version the mutate_each is deprecated:

mutate_each() is deprecated. Use mutate_all(), mutate_at() or mutate_if() instead. To map funs over a selection of variables, use mutate_at()

like image 27
Pablo Casas Avatar answered Sep 22 '22 04:09

Pablo Casas