I am trying to count the number of columns that do not contain NA for each row, and place that value into a new column for that row.
Example data:
library(data.table)
a = c(1,2,3,4,NA)
b = c(6,NA,8,9,10)
c = c(11,12,NA,14,15)
d = data.table(a,b,c)
> d
a b c
1: 1 6 11
2: 2 NA 12
3: 3 8 NA
4: 4 9 14
5: NA 10 15
My desired output would include a new column num_obs
which contains the number of non-NA entries per row:
a b c num_obs
1: 1 6 11 3
2: 2 NA 12 2
3: 3 8 NA 2
4: 4 9 14 3
5: NA 10 15 2
I've been reading for hours now and so far the best I've come up with is looping over rows, which I know is never advisable in R or data.table. I'm sure there is a better way to do this, please enlighten me.
My crappy way:
len = (1:NROW(d))
for (n in len) {
d[n, num_obs := length(which(!is.na(d[n])))]
}
1. Count the Number of NA's per Row with rowSums() The first method to find the number of NA's per row in R uses the power of the functions is.na() and rowSums(). Both the is.na() function and the rowSums() function are R base functions.
The easiest way to count the number of NA's in R in a single column is by using the functions sum() and is.na(). The is.na() function takes one column as input and converts all the missing values into ones and all other values into zeros.
R automatically converts logical vectors to integer vectors when using arithmetic functions. In the process TRUE gets turned to 1 and FALSE gets converted to 0 . Thus, sum(is.na(x)) gives you the total number of missing values in x .
Try this one using Reduce
to chain together +
calls:
d[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))]
If speed is critical, you can eek out a touch more with Ananda's suggestion to hardcode the number of columns being assessed:
d[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))]
Benchmarking using Ananda's larger data.table d
from above:
fun1 <- function(indt) indt[, num_obs := rowSums(!is.na(indt))][]
fun3 <- function(indt) indt[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))][]
fun4 <- function(indt) indt[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))][]
library(microbenchmark)
microbenchmark(fun1(copy(d)), fun3(copy(d)), fun4(copy(d)), times=10L)
#Unit: milliseconds
# expr min lq mean median uq max neval
# fun1(copy(d)) 3.565866 3.639361 3.912554 3.703091 4.023724 4.596130 10
# fun3(copy(d)) 2.543878 2.611745 2.973861 2.664550 3.657239 4.011475 10
# fun4(copy(d)) 2.265786 2.293927 2.798597 2.345242 3.385437 4.128339 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With