I can't come up with a solution to a problem I've had when trying to create my own data.frame
and run a quantitative analysis (such as a chisq.test
) on it.
The backdrop is as follows: I've summarized data I received relating to two hospitals. Both measured the same categorical variable n number of times. In this case it's how frequently health-care associated bacteria were found during a specific observation period.
In a table, the summarized data looks as follows, where % indicates the percentage of all measurements made during the time period.
n Hospital 1 (%) n Hospital 2 (%)
Healthcare associated bacteria 829 (59.4) 578 (57.6)
Community associated bacteria 473 (33.9) 372 (37.1)
Contaminants 94 (6.7) 53 (5.3)
Total 1396 (100.0) 1003 (100.0)
Now looking at the percentages, it's evident that the proportions are very similar and you may wonder why on earth I want to statistically compare the two hospitals. But I have other data, where the proportions are different and so the aims of this question is:
How to compare Hospital 1 to Hospital 2 with regards to the categories measured.
As the data is provided in a summarized fashion and in an array format, I decided to make a data.frame
for each of the single variables/categories.
hosp1 <- rep(c("Yes", "No"), times=c(829,567))
hosp2 <- rep(c("Yes", "No"), times=c(578,425))
all <- cbind(hosp1, c(hosp2,rep(NA, length(hosp1)-length(hosp2))))
all <- data.frame(all)
names(all)[2]<-"hosp2"
summary(all)
So far so good, because the summary seems to look right to be able to now run a chisq.test()
. But now, things get weird.
with(all, chisq.test(hosp1, hosp2, correct=F))
Pearson's Chi-squared test
data: hosp1 and hosp2
X-squared = 286.3087, df = 1, p-value < 2.2e-16
The results, seem to indicate that there's a significant difference. If you crosstabulate the data, you see that R is summarizing it in a very strange way:
with(all, table(hosp1, hosp2))
No Yes
No 174 0
Yes 251 578
So of course if the data is summarized in that way, there'll be a statistically significant finding - because one category is summarized as having no measurments at all. Why on earth is this happening and what can I do to correct it? Finally, instead of having to make a separate data.frame
for each category, is there anyway evident function to loop it? I can't come up with one.
Thanks for your help!
UPDATED BASED ON THELATEMAIL'S REQUEST FOR RAW DATA.FRAME
dput(SO_Example_v1)
structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community",
"Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L,
285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L,
37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L,
34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L,
62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L,
142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type",
"hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType",
"hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType",
"hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType",
"hosp2_CathAssocType"), class = "data.frame", row.names = c(NA,
-3L))
Explanation: This data.frame
is actually more complicated than what is summarized in the table above, as it also contains where the specific types of bacteria where cultured (i.e. in wounds, blood cultures, catheters etc.). So the table that I'm making actually looks as follows:
All locations
n Hospital 1 (%) n Hospital 2 (%) p-val
Healthcare associated bacteria 829 (59.4) 578 (57.6) 0.39
Community associated bacteria 473 (33.9) 372 (37.1) ...
Contaminants 94 (6.7) 53 (5.3) ...
Total 1396 (100.0) 1003 (100.0) -
Where the heading "All locations", will then subsequently be replaced by wound, blood, urine, catheter etc.
The answer to the question about how to get the p-values to work is somewhat simple. You can get the other two p-values that you're looking for using the same syntax as @thelatemail used as follows:
#community (p = 0.1049)
chisq.test(cbind(c(473,923),c(372,631)),correct=FALSE)
#contaminants (p = 0.1443)
chisq.test(cbind(c(94,1302),c(53,950)),correct=FALSE)
You can get these answers more programmatically as follows:
out <- cbind(rowSums(SO_Example_v1[,2:6]),rowSums(SO_Example_v1[,7:11]))
chisq.test(rbind(out[1,],colSums(out[2:3,])),correct=FALSE)
chisq.test(rbind(out[2,],colSums(out[c(1,3),])),correct=FALSE)
chisq.test(rbind(out[3,],colSums(out[1:2,])),correct=FALSE)
Of course, we're getting beyond the scope of SO at this point, but perhaps a more pertinent question given the nature of the data is if there's a difference between the hospitals overall, which you can answer (from a frequentist perspective) using a chi-square test based on all three types:
chisq.test(out,correct=FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With