Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Selecting rows from a dataframe based on a set of values of interest appearing in certain columns

Tags:

r

apply

rows

I have a large dataframe of doctor visit records. I want to select only those rows in which at least one of the 11 diagnosis codes listed is found in a specified set of diagnosis codes that I am interested in.

The dataframe is 18 columns by 39,019 rows. I am interested in diagnosis codes in columns 6:16. Here is a data sample for these 11 diagnosis columns only (to protect identifiable info):

diag1 diag2 diag3 diag4 diag5 diag6 diag7 diag8 diag9 diag10 diag11
786   272   401   782    250  91912  530    NA    NA    NA     NA   
845   530   338   311    NA    NA    NA     NA    NA    NA     NA

Here is the code I have tried to use:

mydiag <- c(401, 410, 411, 413, 415:417, 420:429, 434, 435, 444, 445, 451, 460:466, 480:486, 490:493, 496, 786)
y = apply(dt[,paste("diag", 1:11, sep="")], 1, function(x) sum((any(x !=NA %in% mydiag))))
y = as.data.frame(y)

As you can see, in the 2 example rows I provided, I would want to keep the first row but throw out the second row because it doesn't have any of the codes I want. The code sample I provided doesn't work- I get a vector of 39,019 "1" values. So I'm guessing the apply statement is being read as a logical somehow, and yet I know for a fact that not all of the rows have a code of interest so in that case I would have expected 1's and 0's.

Is there a better way to do this row selection task?

like image 596
mEvans Avatar asked May 07 '12 15:05

mEvans


2 Answers

I think you're overcomplicating things with the !=NA bit in there. Since NA doesn't appear in mydiag, you can drop it completely. So your apply statement then can become:

goodRows <- apply(dat, 1, function(x) any(x %in% mydiag))
dat[goodRows,]
#---------------
  diag1 diag2 diag3 diag4 diag5 diag6 diag7 diag8 diag9 diag10 diag11
1   786   272   401   782   250 91912   530    NA    NA     NA     NA
like image 58
Chase Avatar answered Nov 15 '22 08:11

Chase


The problem comes from your function function(x) sum((any(x !=NA %in% mydiag)))

x != NA could be better constructed using !is.na(x) but you must recognize that this returns a logical vector. So you're taking the result of a logical vector and then checking if the result is in mydiag. I'm guessing you just want to take the values that aren't na and check if any of those are in mydiag.

x[!is.na(x)] %in% mydiag

would work much better for that. But you really don't even need to check the NAs since NA isn't in your vector so any element in x that is NA will return false for x %in% mydiag

function(x){any(x %in% mydiag)}

Is a nice way to get a logical value telling you if the row meets your criteria or not.

# Get the row numbers of the rows you want
id = apply(dt[,paste("diag", 1:11, sep="")], 1, function(x){any(x %in% mydiag)})
# Just grab those rows
y <- dt[id, ]
like image 24
Dason Avatar answered Nov 15 '22 07:11

Dason