Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: data.table count !NA per row

Tags:

r

data.table

I am trying to count the number of columns that do not contain NA for each row, and place that value into a new column for that row.

Example data:

library(data.table)

a = c(1,2,3,4,NA)
b = c(6,NA,8,9,10)
c = c(11,12,NA,14,15)
d = data.table(a,b,c)

> d 
    a  b  c
1:  1  6 11
2:  2 NA 12
3:  3  8 NA
4:  4  9 14
5: NA 10 15

My desired output would include a new column num_obs which contains the number of non-NA entries per row:

    a  b  c num_obs
1:  1  6 11       3
2:  2 NA 12       2
3:  3  8 NA       2
4:  4  9 14       3
5: NA 10 15       2

I've been reading for hours now and so far the best I've come up with is looping over rows, which I know is never advisable in R or data.table. I'm sure there is a better way to do this, please enlighten me.

My crappy way:

len = (1:NROW(d))
for (n in len) {
  d[n, num_obs := length(which(!is.na(d[n])))]
}
like image 304
Reilstein Avatar asked Feb 10 '16 03:02

Reilstein


People also ask

How do I count Na in a row in R?

1. Count the Number of NA's per Row with rowSums() The first method to find the number of NA's per row in R uses the power of the functions is.na() and rowSums(). Both the is.na() function and the rowSums() function are R base functions.

How do I count the number of NA values in a column in R?

The easiest way to count the number of NA's in R in a single column is by using the functions sum() and is.na(). The is.na() function takes one column as input and converts all the missing values into ones and all other values into zeros.

How do I find the number of missing values in R?

R automatically converts logical vectors to integer vectors when using arithmetic functions. In the process TRUE gets turned to 1 and FALSE gets converted to 0 . Thus, sum(is.na(x)) gives you the total number of missing values in x .


1 Answers

Try this one using Reduce to chain together + calls:

d[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))]

If speed is critical, you can eek out a touch more with Ananda's suggestion to hardcode the number of columns being assessed:

d[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))]

Benchmarking using Ananda's larger data.table d from above:

fun1 <- function(indt) indt[, num_obs := rowSums(!is.na(indt))][]
fun3 <- function(indt) indt[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))][]
fun4 <- function(indt) indt[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))][]

library(microbenchmark)
microbenchmark(fun1(copy(d)), fun3(copy(d)), fun4(copy(d)), times=10L)

#Unit: milliseconds
#          expr      min       lq     mean   median       uq      max neval
# fun1(copy(d)) 3.565866 3.639361 3.912554 3.703091 4.023724 4.596130    10
# fun3(copy(d)) 2.543878 2.611745 2.973861 2.664550 3.657239 4.011475    10
# fun4(copy(d)) 2.265786 2.293927 2.798597 2.345242 3.385437 4.128339    10
like image 176
thelatemail Avatar answered Sep 19 '22 16:09

thelatemail