Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get `chisq.test()$p.value` for several groups using `dplyr::group_by()`

I'm trying to conduct a chi square test on several groups within the dplyr frame. The problem is, group_by() %>% summarise() doesn't seem to do trick.

Simulated data (same structure as problematic data, but random, so p.values should be high)

set.seed(1)
data.frame(partido=sample(c("PRI", "PAN"), 100, 0.6),
       genero=sample(c("H", "M"), 100, 0.7), 
       GM=sample(c("Bajo", "Muy bajo"), 100, 0.8)) -> foo

I want to compare several groups defined by GM to see if there are changes in the p.values for the crosstab of partido and genero, conditional to GM.

The obvious dplyr way should be:

foo %>% 
  group_by(GM) %>% 
  summarise(pvalue=chisq.test(.$partido, .$genero)$p.value)  #just the p.value, so summarise is happy

But I get the p.values for the ungrouped data, just to times, not the p.value for each table:

# A tibble: 2 × 2 GM pvalue <fctr> <dbl> 1 Bajo 0.8660521 2 Muy bajo 0.8660521

Testing each group using filter I get:

foo %>% 
  filter(GM=="Bajo") %$% 
  table(partido, genero) %>% 
  chisq.test()

Returns: X-squared = 0.015655, df = 1, p-value = 0.9004

foo %>% 
  filter(GM=="Muy bajo") %$% 
  table(partido, genero) %>% chisq.test()

Returns: X-squared = 0.50409, df = 1, p-value = 0.4777

dplyr:summarise() works with functions with more than one argument, so this shouldn't be the problem:

data.frame(a=1:10, b=10:1, c=sample(c("Grupo 1", "Grupo 2"), 10, 0.5)) %>% 
    group_by(c) %>% 
    summarise(r=cor(a, b))

works like charm. It just doesn't seem to work with chisq.test.

I managed to get what I wanted with nested models using tidyr::nest() and purrr::map(), but I find the code cumbersome --at least for my students. Actually, I´ve invested many ours teaching them (a very math and programming challenged group) dplyr so they could avoid vector functions as much as possible.

foo %>% 
  nest(-GM) %>% 
  mutate(tabla=map(data, ~table(.))) %>% 
  mutate(pvalue=map(tabla, ~chisq.test(.)$p.value)) %>% 
  select(GM, pvalue) %>% 
  unnest()

A tibble: 2 × 2
       GM   pvalue
    <fctr>  <dbl>
1     Bajo  0.9004276
2 Muy bajo  0.4777095

do() does the trick too:

foo %>% 
  group_by(GM) %>% 
  do(tidy(chisq.test(.$partido, .$genero)))

Source: local data frame [2 x 5]
Groups: GM [2]
    GM statistic   p.value parameter
<fctr>     <dbl>     <dbl>     <int>
1     Bajo 0.0156553 0.9004276         1
2 Muy bajo 0.5040878 0.4777095         1
# ... with 1 more variables: method <fctr>

as in: Fisher's and Pearson's test for indepedence

But, ¿why doesn't group_by() work with summarise(chisq.test()$p.value)?

like image 859
mpaladino Avatar asked Jan 04 '23 07:01

mpaladino


1 Answers

In dplyr you can generally just use unquoted variable names to access the relevant columns, whether you're in a groupby or otherwise. So removing the .$ accessors from .$partido and .$genero which are not needed I get:

foo %>% 
    group_by(GM) %>% 
    summarise(pvalue= chisq.test(partido, genero)$p.value) 

# A tibble: 2 × 2
        GM    pvalue
    <fctr>     <dbl>
1     Bajo 0.9004276
2 Muy bajo 0.4777095
like image 181
Marius Avatar answered Jan 08 '23 06:01

Marius