I have a large dataframe of doctor visit records. Each record (row) can have up to 11 diagnosis codes. I want to know how many non-NA diagnosis codes are in each row.
Here is a sample of the data:
diag1 diag2 diag3 diag4 diag5 diag6 diag7 diag8 diag9 diag10 diag11
786 272 401 782 250 91912 530 NA NA NA NA
845 530 338 311 NA NA NA NA NA NA NA
So in these two rows, I would want to know that row 1 had 7 codes and row 2 had 4 codes. The dataframe is 31,596 rows so a loop is taking way too long. I'd like to use an "apply" statement to speed things up:
z = apply(y[,paste("diag", 1:11, sep="")], 1, function(x)sum({any(x[!is.na(x)])}))
R just returns a vector of 1's that is the same length as the number of rows in the dataset. I think something is wrong with using "any"? Does anyone have a good way to count the number of non-NA values across multiple columns? Thanks!
To find the sum of non-missing values in an R data frame column, we can simply use sum function and set the na. rm to TRUE. For example, if we have a data frame called df that contains a column say x which has some missing values then the sum of the non-missing values can be found by using the command sum(df$x,na.
R automatically converts logical vectors to integer vectors when using arithmetic functions. In the process TRUE gets turned to 1 and FALSE gets converted to 0 . Thus, sum(is.na(x)) gives you the total number of missing values in x .
Just use is.na
and rowSums
:
z <- rowSums(!is.na(y[,paste("diag", 1:11, sep="")]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With