I'm new to purrr, Hadley's promising functional programming R library. I'm trying to take a grouped and split dataframe and run a t-test on a variable. An example using a sample dataset might look like this.
mtcars %>%
dplyr::select(cyl, mpg) %>%
group_by(as.character(cyl)) %>%
split(.$cyl) %>%
map(~ t.test(.$`4`$mpg, .$`6`$mpg))
This results in the following error:
Error in var(x) : 'x' is NULL
In addition: Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
2: In mean.default(x) : argument is not numeric or logical: returning NA
Am I just misunderstanding how map
works? Or is there a better way to think about this?
I don't fully understand the expected result, but this might be a starting point for an answer. map()
from purrr
uses .x
in the formula argument.
Here is one way to accomplish what I think you are trying to do with just purrr
.
mtcars %>%
split(as.character(.$cyl)) %>%
map(~t.test(.x$mpg))
But, purrr::by_slice()
pairs nicely with dplyr::group_by()
.
library(purrr)
library(dplyr)
mtcars %>%
dplyr::select(cyl, mpg) %>%
group_by(as.character(cyl)) %>%
by_slice(~ t.test(.x$mpg))
Or, you could skip purrr
entirely using dplyr:::summarise()
.
library(purrr)
library(dplyr)
mtcars %>%
dplyr::select(cyl, mpg) %>%
group_by(as.character(cyl)) %>%
summarise(t_test = data_frame(t.test(.$mpg)))
If the nested data.frame
is confusing, broom
can help us get an easy data.frame
summary of the results.
purrr
+ broom
+ tidyr
library(broom)
library(tidyr)
mtcars %>%
group_by(as.character(cyl)) %>%
by_slice(~tidy(t.test(.x$mpg))) %>%
unnest()
dplyr
+ broom
library(broom)
mtcars %>%
dplyr::select(cyl, mpg) %>%
group_by(as.character(cyl)) %>%
do(tidy(t.test(.$mpg)))
Edited to include response to comment
With pipes, we can get carried away quite quickly. I think Walt did a nice job in his answer, but I wanted to make sure that I provided a purrr
-ty answer. I hope the use of pipeR
is not overly confusing.
library(purrr)
library(dplyr)
library(broom)
library(tidyr)
library(pipeR)
mtcars %>>%
(split(.,.$cyl)) %>>%
(split_cyl~
names(split_cyl) %>>%
(
cross_d(
list(against=.,tested=.),
.filter = `==`
)
) %>>%
by_row(
~tidy(t.test(split_cyl[[.x$tested]]$mpg,split_cyl[[.x$against]]$mpg))
)
) %>>%
unnest()
Especially when dealing with pipes that require multiple inputs (we don't have Haskell's Arrows here), I find it easier to reason by types/signatures first, then encapsulate logic in functions (which you can unit test), then write a concise chain.
In this case you want to compare all possible pairs of vectors, so I would set a goal of writing a function that takes a pair (i.e. a list of 2) of vectors and returns the 2-way t.test of them.
Once you've done this, you just need some glue. So the plan is:
It's important to have this plan before writing any code. Things are somehow obfuscated by the fact that R is not strongly typed, but this way you reason about "types" first, implementation second.
t.test takes dots, so we use purrr:lift
to have it take a list. Since we don't want to match on the names of the elements of the list, we use .unnamed = TRUE
. Also we make it extra clear we're using the t.test
function with arity of 2 (though this extra step is not needed for the code to work).
t.test2 <- function(x, y) t.test(x, y)
liftedTT <- lift(t.test2, .unnamed = TRUE)
Wrap the function we got in step 1 into a functional chain that takes a simple pair (here I use indexes, it should be easy to use cyl factor levels, but I don't have time to figure it out).
doTT <- function(pair) {
mtcars %>%
split(as.character(.$cyl)) %>%
map(~ select(., mpg)) %>%
extract(pair) %>%
liftedTT %>%
broom::tidy
}
Now that we have all our lego pieces ready, composition is trivial.
1:length(unique(mtcars$cyl)) %>%
combn(2) %>%
as.data.frame %>%
as.list %>%
map(~ doTT(.))
$V1
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
1 6.920779 26.66364 19.74286 4.719059 0.0004048495 12.95598 3.751376 10.09018
$V2
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
1 11.56364 26.66364 15.1 7.596664 1.641348e-06 14.96675 8.318518 14.80876
$V3
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
1 4.642857 19.74286 15.1 5.291135 4.540355e-05 18.50248 2.802925 6.482789
There's quite a bit here to clean up, mainly using factor levels and preserving them in the output (and not using globals in the second function) but I think the core of what you wanted is here. The trick not to get lost, in my experience, is to work from the inside out.
To perform the two sample t-tests, you have to create the combinations of the numbers of cylinders. I don't see that you can create the combinations using purrr
functions. However a way which uses only purrr
and base R functions is
library(purrr)
t_test2 <- mtcars %>% split(.$cyl) %>%
transpose() %>%
.[["mpg"]] %>%
(function(x) combn(names(x), m=2, function(y) t.test(flatten_dbl(x[y[1]]), flatten_dbl(x[y[2]])) , simplify=FALSE))
although this does seem a bit contrived.
A similar approach which uses only base R functions with chaining is
t_test <- mtcars %>% split(.$cyl) %>%
(function(x) combn(names(x), m=2, function(y) x[y], simplify=FALSE)) %>%
lapply( function(x) t.test(x[[1]]$mpg, x[[2]]$mpg))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With