Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

purrr map a t.test onto a split df

Tags:

r

purrr

I'm new to purrr, Hadley's promising functional programming R library. I'm trying to take a grouped and split dataframe and run a t-test on a variable. An example using a sample dataset might look like this.

mtcars %>% 
  dplyr::select(cyl, mpg) %>% 
  group_by(as.character(cyl)) %>% 
  split(.$cyl) %>% 
  map(~ t.test(.$`4`$mpg, .$`6`$mpg))

This results in the following error:

Error in var(x) : 'x' is NULL
In addition: Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
2: In mean.default(x) : argument is not numeric or logical: returning NA

Am I just misunderstanding how map works? Or is there a better way to think about this?

like image 745
Samarth Bhaskar Avatar asked Feb 22 '16 16:02

Samarth Bhaskar


3 Answers

I don't fully understand the expected result, but this might be a starting point for an answer. map() from purrr uses .x in the formula argument.

Here is one way to accomplish what I think you are trying to do with just purrr.

mtcars %>%
  split(as.character(.$cyl)) %>%
  map(~t.test(.x$mpg)) 

But, purrr::by_slice() pairs nicely with dplyr::group_by().

library(purrr)
library(dplyr)

mtcars %>% 
  dplyr::select(cyl, mpg) %>% 
  group_by(as.character(cyl)) %>%
  by_slice(~ t.test(.x$mpg))

Or, you could skip purrr entirely using dplyr:::summarise().

library(purrr)
library(dplyr)

mtcars %>% 
  dplyr::select(cyl, mpg) %>% 
  group_by(as.character(cyl)) %>%
  summarise(t_test = data_frame(t.test(.$mpg)))

If the nested data.frame is confusing, broom can help us get an easy data.frame summary of the results.

purrr + broom + tidyr

library(broom)
library(tidyr)
mtcars %>%
  group_by(as.character(cyl)) %>%
  by_slice(~tidy(t.test(.x$mpg))) %>%
  unnest()

dplyr + broom

library(broom)

mtcars %>% 
  dplyr::select(cyl, mpg) %>% 
  group_by(as.character(cyl)) %>%
  do(tidy(t.test(.$mpg)))

Edited to include response to comment

With pipes, we can get carried away quite quickly. I think Walt did a nice job in his answer, but I wanted to make sure that I provided a purrr-ty answer. I hope the use of pipeR is not overly confusing.

library(purrr)
library(dplyr)
library(broom)
library(tidyr)
library(pipeR)

mtcars %>>%
  (split(.,.$cyl)) %>>%
  (split_cyl~
    names(split_cyl) %>>%
     (
       cross_d(
         list(against=.,tested=.),
         .filter = `==`
       )
     ) %>>%
     by_row(
       ~tidy(t.test(split_cyl[[.x$tested]]$mpg,split_cyl[[.x$against]]$mpg))
     )
  ) %>>%
  unnest()
like image 103
timelyportfolio Avatar answered Nov 13 '22 01:11

timelyportfolio


Especially when dealing with pipes that require multiple inputs (we don't have Haskell's Arrows here), I find it easier to reason by types/signatures first, then encapsulate logic in functions (which you can unit test), then write a concise chain.

In this case you want to compare all possible pairs of vectors, so I would set a goal of writing a function that takes a pair (i.e. a list of 2) of vectors and returns the 2-way t.test of them.

Once you've done this, you just need some glue. So the plan is:

  1. Write function that takes a list of vectors and performs the 2-way t-test.
  2. Write a function/pipe that fetches the vectors from mtcars (easy).
  3. Map the above over the list of pairs.

It's important to have this plan before writing any code. Things are somehow obfuscated by the fact that R is not strongly typed, but this way you reason about "types" first, implementation second.

Step 1

t.test takes dots, so we use purrr:lift to have it take a list. Since we don't want to match on the names of the elements of the list, we use .unnamed = TRUE. Also we make it extra clear we're using the t.test function with arity of 2 (though this extra step is not needed for the code to work).

t.test2 <- function(x, y) t.test(x, y)
liftedTT <- lift(t.test2, .unnamed = TRUE)

Step 2

Wrap the function we got in step 1 into a functional chain that takes a simple pair (here I use indexes, it should be easy to use cyl factor levels, but I don't have time to figure it out).

doTT <- function(pair) {
  mtcars %>%
    split(as.character(.$cyl)) %>%
    map(~ select(., mpg)) %>% 
    extract(pair) %>% 
    liftedTT %>% 
    broom::tidy
}

Step 3

Now that we have all our lego pieces ready, composition is trivial.

1:length(unique(mtcars$cyl)) %>% 
  combn(2) %>% 
  as.data.frame %>% 
  as.list %>% 
  map(~ doTT(.))

$V1
  estimate estimate1 estimate2 statistic      p.value parameter conf.low conf.high
1 6.920779  26.66364  19.74286  4.719059 0.0004048495  12.95598 3.751376  10.09018

$V2
  estimate estimate1 estimate2 statistic      p.value parameter conf.low conf.high
1 11.56364  26.66364      15.1  7.596664 1.641348e-06  14.96675 8.318518  14.80876

$V3
  estimate estimate1 estimate2 statistic      p.value parameter conf.low conf.high
1 4.642857  19.74286      15.1  5.291135 4.540355e-05  18.50248 2.802925  6.482789

There's quite a bit here to clean up, mainly using factor levels and preserving them in the output (and not using globals in the second function) but I think the core of what you wanted is here. The trick not to get lost, in my experience, is to work from the inside out.

like image 6
Roberto Avatar answered Nov 13 '22 00:11

Roberto


To perform the two sample t-tests, you have to create the combinations of the numbers of cylinders. I don't see that you can create the combinations using purrr functions. However a way which uses only purrr and base R functions is

library(purrr)
t_test2 <- mtcars %>% split(.$cyl) %>%
          transpose() %>%
          .[["mpg"]] %>%
          (function(x) combn(names(x), m=2, function(y) t.test(flatten_dbl(x[y[1]]), flatten_dbl(x[y[2]])) , simplify=FALSE))

although this does seem a bit contrived.

A similar approach which uses only base R functions with chaining is

t_test <- mtcars %>% split(.$cyl) %>%
                          (function(x) combn(names(x), m=2, function(y) x[y], simplify=FALSE)) %>%
                           lapply( function(x) t.test(x[[1]]$mpg, x[[2]]$mpg))
like image 2
WaltS Avatar answered Nov 13 '22 01:11

WaltS