Can anyone give a suggestion regarding when to use the map()
(all map_..() functions) and when to use summarise_at()
/mutate_at()
?
E.g. if we are doing some modification to the column of vectors then we do not need to think map()
?
If we have a df / have a column has a list in it then we need to use map()
?
Does map()
function always need to be used with nest()
function?
Anyone could suggest some learning videos regarding this. And also how to put lists in df and modeling multiple lists at the same time then store the model results in another column ?
Thank you so much!
The biggest difference between {dplyr} and {purrr} is that {dplyr} is designed to work on data.frames only, and {purrr} is designed to work on every kind of lists. Data.frames being lists, you can also use {purrr} for iterating on a data.frame.
map_chr(iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
"numeric" "numeric" "numeric" "numeric" "factor"
summarise_at
and map_at
do not exactly behave the same: summarise_at
just return the summary you're looking for, map_at
return all the data.frame as a list, with the modification done where you asked it :
> library(purrr)
> library(dplyr)
> small_iris <- sample_n(iris, 5)
> map_at(small_iris, c("Sepal.Length", "Sepal.Width"), mean)
$Sepal.Length
[1] 6.58
$Sepal.Width
[1] 3.2
$Petal.Length
[1] 6.7 1.3 5.7 4.3 4.7
$Petal.Width
[1] 2.0 0.4 2.1 1.3 1.5
$Species
[1] virginica setosa virginica versicolor versicolor
Levels: setosa versicolor virginica
> summarise_at(small_iris, c("Sepal.Length", "Sepal.Width"), mean)
Sepal.Length Sepal.Width
1 6.58 3.2
map_at
always return a list, mutate_at
always a data.frame :
> map_at(small_iris, c("Sepal.Length", "Sepal.Width"), ~ .x / 10)
$Sepal.Length
[1] 0.77 0.54 0.67 0.64 0.67
$Sepal.Width
[1] 0.28 0.39 0.33 0.29 0.31
$Petal.Length
[1] 6.7 1.3 5.7 4.3 4.7
$Petal.Width
[1] 2.0 0.4 2.1 1.3 1.5
$Species
[1] virginica setosa virginica versicolor versicolor
Levels: setosa versicolor virginica
> mutate_at(small_iris, c("Sepal.Length", "Sepal.Width"), ~ .x / 10)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 0.77 0.28 6.7 2.0 virginica
2 0.54 0.39 1.3 0.4 setosa
3 0.67 0.33 5.7 2.1 virginica
4 0.64 0.29 4.3 1.3 versicolor
5 0.67 0.31 4.7 1.5 versicolor
So to sum up on your first question, if you are thinking about doing operation "column-wise" on a non-nested df and want to have a data.frame as a result, you should go for {dplyr}.
Regarding nested column, you have to combine group_by()
, nest()
from {tidyr}, mutate()
and map()
. What you're doing here is creating a smaller version of your dataframe that will contain a column which is a list of data.frames. Then, you're going to use map()
to iterate over the elements inside this new column.
Here is an example with our beloved iris:
library(tidyr)
iris_n <- iris %>%
group_by(Species) %>%
nest()
iris_n
# A tibble: 3 x 2
Species data
<fct> <list>
1 setosa <tibble [50 × 4]>
2 versicolor <tibble [50 × 4]>
3 virginica <tibble [50 × 4]>
Here, the new object is a data.frame with the colum data
being a list of smaller data.frames, one by Species (the factor we specified in group_by()
). Then, we can iterate on this column by simply doing :
map(iris_n$data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x))
[[1]]
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)
Coefficients:
(Intercept) Sepal.Width
2.6390 0.6905
[[2]]
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)
Coefficients:
(Intercept) Sepal.Width
3.5397 0.8651
[[3]]
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = .x)
Coefficients:
(Intercept) Sepal.Width
3.9068 0.9015
But the idea is to keep everything inside a data.frame, so we can use mutate
to create a column that will keep this new list of lm
results:
iris_n %>%
mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)))
# A tibble: 3 x 3
Species data lm
<fct> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
So you can run several mutate()
to get the r.squared
for e.g:
iris_n %>%
mutate(lm = map(data, ~ lm(Sepal.Length ~ Sepal.Width, data = .x)),
lm = map(lm, summary),
r_squared = map_dbl(lm, "r.squared"))
# A tibble: 3 x 4
Species data lm r_squared
<fct> <list> <list> <dbl>
1 setosa <tibble [50 × 4]> <S3: summary.lm> 0.551
2 versicolor <tibble [50 × 4]> <S3: summary.lm> 0.277
3 virginica <tibble [50 × 4]> <S3: summary.lm> 0.209
But a more efficient way is to use compose()
from {purrr} to build a function that will do it once, instead of repeating the mutate()
.
get_rsquared <- compose(as_mapper("r.squared"), summary, lm)
iris_n %>%
mutate(lm = map_dbl(data, ~ get_rsquared(Sepal.Length ~ Sepal.Width, data = .x)))
# A tibble: 3 x 3
Species data lm
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 0.551
2 versicolor <tibble [50 × 4]> 0.277
3 virginica <tibble [50 × 4]> 0.209
If you know you'll always be using Sepal.Length ~ Sepal.Width
, you can even prefill lm()
with partial()
:
pr_lm <- partial(lm, formula = Sepal.Length ~ Sepal.Width)
get_rsquared <- compose(as_mapper("r.squared"), summary, pr_lm)
iris_n %>%
mutate(lm = map_dbl(data, get_rsquared))
# A tibble: 3 x 3
Species data lm
<fct> <list> <dbl>
1 setosa <tibble [50 × 4]> 0.551
2 versicolor <tibble [50 × 4]> 0.277
3 virginica <tibble [50 × 4]> 0.209
Regarding the resources, I've written a series of blogpost on {purrr} you can check: https://colinfay.me/tags/#purrr
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With