Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chi-squared test of independence on all combinations of columns in a dataframe in R

Tags:

r

chi-squared

this is my first time posting here and I hope this is all in the right place. I have been using R for basic statistical analysis for some time, but haven't really used it for anything computationally challenging and I'm very much a beginner in the programming/ data manipulation side of R.

I have presence/absence (binary) data on 72 plant species in 323 plots in a single catchment. The dataframe is 323 rows, each representing a plot, with 72 columns, each representing a species. This is a sample of the first 4 columns (some row numbers are missing because the 323 plots are a subset of a larger number of preassigned plots, not all of which were surveyed):

> head(plots[,1:4])
 Agrostis.canina Agrostis.capillaris Alchemilla.alpina Anthoxanthum.odoratum
1               1                   0                 0                     0
3               0                   0                 0                     0
4               0                   0                 0                     0
5               0                   0                 0                     0
6               0                   0                 0                     0
8               0                   0                 0                     0

I want to to determine whether any of the plant species in this catchment are associated with any others, and if so, whether that is a positive or negative association. To do this I want to perform a chi-squared test of independence on each combination of species. I need to create a 2x2 contingency table for each speciesxspecies comparison, run a chi-squared test on each of those contingency tables, and save the output. Ultimately I would like to end up with a list or matrix of all species by species tests that shows whether that combination of species has a positive, negative, or no significant association. I'd also like to incorporate some code that only shows an association as positive if all expected values were greater than 5.

I have made a start by writing the following function:

CHI <- function(sppx, sppy) 
{test <- chisq.test(table(sppx, sppy)) 
result <- c(test$statistic, test$p.value,
        sign((table(sppx, sppy) - test$expected)[2,2]))
return(result)
}

This returns the following:

> CHI(plots$Agrostis.canina, plots$Agrostis.capillaris)

X-squared                             
1.095869e-27  1.000000e+00 -1.000000e+00 
Warning message:
In chisq.test(chitbl) : Chi-squared approximation may be incorrect

Now I'm trying to figure out a way to apply this function to each speciesxspecies combination in the data frame. I essentially want R to take each column, apply the CHI function to that column and each other column in sequence, and so on through all the columns, subtracting each column from the dataframe as it is done so the same species pair is not tested twice. I have tried various methods trying to use "for" loops or "apply" functions, but have not been able to figure this out. I hope that is clear enough. Any help here would be much appreciated. I have tried looking for existing solutions to this specific problem online, but haven't been able to find any that really helped. If anyone could link me to an existing answer to this that would also be great.

like image 209
YJS Avatar asked May 23 '16 13:05

YJS


People also ask

Can you run a chi-square on multiple variables?

The Chi-Square Test of Independence can only compare categorical variables. It cannot make comparisons between continuous variables or between categorical and continuous variables.

Can chi-square be used for grouped data?

The test requires that the data first be grouped. The actual number of observations in each group is compared to the expected number of observations and the test statistic is calculated as a function of this difference.

Can chi-square can be used to compare data for more than two groups?

The critical value in the table is , and we conclude that there are no significant differences between males and females in their political affiliations. Chi-square can also be used with more than two categories.


2 Answers

You need the combn function to find all the combinations of the columns and then apply them to your function, something like this:

apply(combn(1:ncol(plots), 2), 2, function(ind) CHI(plots[, ind[1]], plots[, ind[2]]))
like image 62
Psidom Avatar answered Oct 09 '22 07:10

Psidom


I think you are looking for something like this. I used the iris dataset.

require(datasets)
ind<-combn(NCOL(iris),2)
lapply(1:NCOL(ind), function (i) CHI(iris[,ind[1,i]],iris[,ind[2,i]]))
like image 36
Jeonifer Avatar answered Oct 09 '22 06:10

Jeonifer