I have spent a good deal of time looking around and cannot find a solution to my specific question. I would really appreciate any help.
I have a large data.frame (1258 obs. of 298 variables) where each of the rows is a participant sample record and each of the columns is a specific bacterial genus found within the sample. I then have multiple records for each participant, which is indicated in a column variable as well.
Here is an example of what the data frame can look like.
Corynebacterium <- c(0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.5, 0.7, 0.1, 0.0)
Paenibacillus <- c(0.0, 0.1, 0.7, 0.3, 0.5, 0.7, 0.0, 0.0, 0.0, 0.3, 0.3, 0.0)
Psychrobacter <- c(0.1, 0.1, 0.5, 0.0, 0.0, 0.0, 0.3, 0.6, 0.0, 0.6, 0.7, 0.0)
Staphylocccus <- c(0.5, 0.0, 0.3, 0.0, 0.3, 0.2, 0.5, 0.0, 0.4, 0.1, 0.1, 0.5)
TimePoint <- c("A", "B", "C", "D", "E", "F", "A", "B", "C", "D", "E", "F")
SampleDF <- data.frame(Corynebacterium, Paenibacillus, Psychrobacter,
Staphylocccus, TimePoint)
I would like to know the number of non-zero cells over the total number of cells for a given timepoint.
For example: for Corynebacterium at TimePoint A, it would be #NonZeroCells/Total#Cells = 1/2 = 0.5. A different way of considering this is 50% of the cells for Corynebacterium at TimePoint A are non-zero.
Here's a dplyr answer:
SampleDF %>%
group_by(TimePoint) %>%
summarise_each(funs(sum(. != 0) / length(.)))
# TimePoint Corynebacterium Paenibacillus Psychrobacter Staphylocccus
# 1 A 0.5 0.0 1.0 1.0
# 2 B 0.5 0.5 1.0 0.0
# 3 C 0.5 0.5 0.5 1.0
# 4 D 0.5 1.0 0.5 0.5
# 5 E 0.5 1.0 0.5 1.0
# 6 F 0.0 0.5 0.0 1.0
You could also do this very simply in base R:
aggregate(. ~ TimePoint, data=SampleDF, function(x) sum(x != 0) / length(x))
Personally, I prefer not using external packages when working if I can avoid it. If you are like me, the best way to do something like this is to use the aggregate()
built-in along with a few simple custom functions.
What aggregate
does is break a data frame into a bunch of smaller ones based around some grouping variable, and then passes each column to a function of your choice. You can use built-in functions like sum
or you can also write your own.
In your case, you want to find the percentage of non-zero values within each grouping. Here are two simple examples.
func.simple_count <- function(data.vector) {
return(sum(data.vector!=0))
}
aggregate(x = SampleDF[c("Corynebacterium","Paenibacillus","Psychrobacter","Staphylocccus")],
by = list(SampleDF$TimePoint),
FUN = func.simple_count)
Output:
Group.1 Corynebacterium Paenibacillus Psychrobacter Staphylocccus
1 A 1 0 2 2
2 B 1 1 2 0
3 C 1 1 1 2
4 D 1 2 1 1
5 E 1 2 1 2
6 F 0 1 0 2
func.percent_nonzero <- function(data.vector) {
return(sum(data.vector!=0)/length(data.vector))
}
aggregate(x = SampleDF[c("Corynebacterium","Paenibacillus","Psychrobacter","Staphylocccus")],
by = list(SampleDF$TimePoint),
FUN = func.percent_nonzero)
Output:
Group.1 Corynebacterium Paenibacillus Psychrobacter Staphylocccus
1 A 0.5 0.0 1.0 1.0
2 B 0.5 0.5 1.0 0.0
3 C 0.5 0.5 0.5 1.0
4 D 0.5 1.0 0.5 0.5
5 E 0.5 1.0 0.5 1.0
6 F 0.0 0.5 0.0 1.0
When doing it on a larger data frame, rather than explicitly listing the variables in the aggregate
statement, as I did, you could instead use the names()
function and !=
to just exclude the grouping variable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With