I have a large file with the first column being IDs, and the remaining 1304 columns being genotypes like below.
rsID sample1 sample2 sample3...sample1304
abcd aa bb nc nc
efgh nc nc nc nc
ijkl aa ab aa nc
I would like to count the number of "nc" values per row and output the result of that to another column so that I get the following:
rsID sample1 sample2 sample3...sample1304 no_calls
abcd aa bb nc nc 2
efgh nc nc nc nc 4
ijkl aa ab aa nc 1
The table function counts frequencies per column, not row and if I transpose the data to use in the table function, I would need the file to look like this:
abcd aa[sample1]
abcd bb[sample2]
abcd nc[sample3] ...
abcd nc[sample1304]
efgh nc[sample1]
efgh nc[sample2]
efgh nc[sample3] ...
efgh nc[sample1304]
With this format, I would get the following which is what I want:
ID nc aa ab bb
abcd 2 1 0 1
efgh 4 0 0 0
Does anybody have any idea of an simple way to get frequencies by row? I am trying this right now, but it is taking quite some time to run:
rsids$Number_of_no_calls <- apply(rsids, 1, function(x) sum(x=="NC"))
You can use rowSums
.
df$no_calls <- rowSums(df == "nc")
df
# rsID sample1 sample2 sample3 sample1304 no_calls
#1 abcd aa bb nc nc 2
#2 efgh nc nc nc nc 4
#3 ijkl aa ab aa nc 1
Or, as pointed out by MrFlick, to exclude the first column from the row sums, you can slightly modify the approach to
df$no_calls <- rowSums(df[-1] == "nc")
Regarding the row names: They are not counted in rowSums
and you can make a simple test to demonstrate it:
rownames(df)[1] <- "nc" # name first row "nc"
rowSums(df == "nc") # compute the row sums
#nc 2 3
# 2 4 1 # still the same in first row
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With