Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining the number of non-zero cells and calculating the prevalence by a stratifying variable

Tags:

r

I have spent a good deal of time looking around and cannot find a solution to my specific question. I would really appreciate any help.

I have a large data.frame (1258 obs. of 298 variables) where each of the rows is a participant sample record and each of the columns is a specific bacterial genus found within the sample. I then have multiple records for each participant, which is indicated in a column variable as well.

Here is an example of what the data frame can look like.

Corynebacterium <- c(0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.5, 0.7, 0.1, 0.0)
Paenibacillus <- c(0.0, 0.1, 0.7, 0.3, 0.5, 0.7, 0.0, 0.0, 0.0, 0.3, 0.3, 0.0)
Psychrobacter <- c(0.1, 0.1, 0.5, 0.0, 0.0, 0.0, 0.3, 0.6, 0.0, 0.6, 0.7, 0.0)
Staphylocccus <- c(0.5, 0.0, 0.3, 0.0, 0.3, 0.2, 0.5, 0.0, 0.4, 0.1, 0.1, 0.5)
TimePoint <- c("A", "B", "C", "D", "E", "F", "A", "B", "C", "D", "E", "F")
SampleDF <- data.frame(Corynebacterium, Paenibacillus, Psychrobacter, 
Staphylocccus, TimePoint)

I would like to know the number of non-zero cells over the total number of cells for a given timepoint.

For example: for Corynebacterium at TimePoint A, it would be #NonZeroCells/Total#Cells = 1/2 = 0.5. A different way of considering this is 50% of the cells for Corynebacterium at TimePoint A are non-zero.

like image 876
EpiBlake Avatar asked Feb 10 '23 15:02

EpiBlake


2 Answers

Here's a dplyr answer:

SampleDF %>%
    group_by(TimePoint) %>%
    summarise_each(funs(sum(. != 0) / length(.)))

#   TimePoint Corynebacterium Paenibacillus Psychrobacter Staphylocccus
# 1         A             0.5           0.0           1.0           1.0
# 2         B             0.5           0.5           1.0           0.0
# 3         C             0.5           0.5           0.5           1.0
# 4         D             0.5           1.0           0.5           0.5
# 5         E             0.5           1.0           0.5           1.0
# 6         F             0.0           0.5           0.0           1.0

You could also do this very simply in base R:

aggregate(. ~ TimePoint, data=SampleDF, function(x) sum(x != 0) / length(x))
like image 126
Matthew Plourde Avatar answered Feb 13 '23 07:02

Matthew Plourde


Personally, I prefer not using external packages when working if I can avoid it. If you are like me, the best way to do something like this is to use the aggregate() built-in along with a few simple custom functions.

What aggregate does is break a data frame into a bunch of smaller ones based around some grouping variable, and then passes each column to a function of your choice. You can use built-in functions like sum or you can also write your own.

In your case, you want to find the percentage of non-zero values within each grouping. Here are two simple examples.

func.simple_count <- function(data.vector) {

    return(sum(data.vector!=0))
}
aggregate(x = SampleDF[c("Corynebacterium","Paenibacillus","Psychrobacter","Staphylocccus")],
          by = list(SampleDF$TimePoint),
          FUN = func.simple_count)

Output:

  Group.1 Corynebacterium Paenibacillus Psychrobacter Staphylocccus
1       A               1             0             2             2
2       B               1             1             2             0
3       C               1             1             1             2
4       D               1             2             1             1
5       E               1             2             1             2
6       F               0             1             0             2

func.percent_nonzero <- function(data.vector) {

    return(sum(data.vector!=0)/length(data.vector))
}
aggregate(x = SampleDF[c("Corynebacterium","Paenibacillus","Psychrobacter","Staphylocccus")],
          by = list(SampleDF$TimePoint),
          FUN = func.percent_nonzero)

Output:

  Group.1 Corynebacterium Paenibacillus Psychrobacter Staphylocccus
1       A             0.5           0.0           1.0           1.0
2       B             0.5           0.5           1.0           0.0
3       C             0.5           0.5           0.5           1.0
4       D             0.5           1.0           0.5           0.5
5       E             0.5           1.0           0.5           1.0
6       F             0.0           0.5           0.0           1.0

When doing it on a larger data frame, rather than explicitly listing the variables in the aggregate statement, as I did, you could instead use the names() function and != to just exclude the grouping variable.

like image 42
TARehman Avatar answered Feb 13 '23 08:02

TARehman