I have a large dataframe of doctor visit records. I want to select only those rows in which at least one of the 11 diagnosis codes listed is found in a specified set of diagnosis codes that I am interested in.
The dataframe is 18 columns by 39,019 rows. I am interested in diagnosis codes in columns 6:16. Here is a data sample for these 11 diagnosis columns only (to protect identifiable info):
diag1 diag2 diag3 diag4 diag5 diag6 diag7 diag8 diag9 diag10 diag11
786 272 401 782 250 91912 530 NA NA NA NA
845 530 338 311 NA NA NA NA NA NA NA
Here is the code I have tried to use:
mydiag <- c(401, 410, 411, 413, 415:417, 420:429, 434, 435, 444, 445, 451, 460:466, 480:486, 490:493, 496, 786)
y = apply(dt[,paste("diag", 1:11, sep="")], 1, function(x) sum((any(x !=NA %in% mydiag))))
y = as.data.frame(y)
As you can see, in the 2 example rows I provided, I would want to keep the first row but throw out the second row because it doesn't have any of the codes I want. The code sample I provided doesn't work- I get a vector of 39,019 "1" values. So I'm guessing the apply statement is being read as a logical somehow, and yet I know for a fact that not all of the rows have a code of interest so in that case I would have expected 1's and 0's.
Is there a better way to do this row selection task?
I think you're overcomplicating things with the !=NA
bit in there. Since NA doesn't appear in mydiag
, you can drop it completely. So your apply statement then can become:
goodRows <- apply(dat, 1, function(x) any(x %in% mydiag))
dat[goodRows,]
#---------------
diag1 diag2 diag3 diag4 diag5 diag6 diag7 diag8 diag9 diag10 diag11
1 786 272 401 782 250 91912 530 NA NA NA NA
The problem comes from your function function(x) sum((any(x !=NA %in% mydiag)))
x != NA
could be better constructed using !is.na(x)
but you must recognize that this returns a logical vector. So you're taking the result of a logical vector and then checking if the result is in mydiag. I'm guessing you just want to take the values that aren't na and check if any of those are in mydiag.
x[!is.na(x)] %in% mydiag
would work much better for that. But you really don't even need to check the NAs since NA isn't in your vector so any element in x that is NA will return false for x %in% mydiag
function(x){any(x %in% mydiag)}
Is a nice way to get a logical value telling you if the row meets your criteria or not.
# Get the row numbers of the rows you want
id = apply(dt[,paste("diag", 1:11, sep="")], 1, function(x){any(x %in% mydiag)})
# Just grab those rows
y <- dt[id, ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With