Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How can I remove observations from a data frame conditionally without losing NA values in R?

In the data frame there is a variable called YOB. As you can see, there are 333 NA values.

> summary(train$YOB)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1880    1970    1983    1980    1993    2039     333 

I identified some outliers and want to get rid of them. Anything less than 1900 and greater than 2003 shall be removed. I tried to do this by indexing.

train = train[which(train$YOB >= 1900 & train$YOB <= 2003),]

Unfortunately observations whose YOB variable were NA are also removed.

> summary(train$YOB)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1900    1970    1983    1980    1993    2003 

On a side note, I face the same problem when using subset command.

> train = subset(train, YOB >= 1900 & YOB <= 2003)
> summary(train$YOB)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1900    1970    1983    1980    1993    2003 

I have also tried to use this condition in both attempts, but with no success, e.g.

> train = train[which(!is.na(train$YOB) & train$YOB >= 1900 & train$YOB <= 2003),]
> summary(train$YOB)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1900    1970    1983    1980    1993    2003 

I would like to keep the observations that have NA in the YOB variable and only remove those that are numeric. The idea is in a second step to impute missing values.

like image 506
Ely Avatar asked Oct 18 '22 08:10


1 Answers

The which will give the numeric index and skip all those NA rows. To avoid that, use the logical index without wrapping with which. The index will be NA in that way and that row will remain as NA even if there are other values that are non-NA.

res1 <- train[train$YOB >= 1900 & train$YOB <= 2003,]
#   YOB col2
#NA  NA   NA

The correct way would be to have another condition with is.na

res2 <- train[is.na(train$YOB)| (train$YOB >= 1900 & train$YOB <= 2003),]
#   YOB      col2
#42  NA 0.2258094

Using a simple example

d1 <- data.frame(v1 = c(NA, 1, 5), v2 = rnorm(3))
d1$v1 >1
#[1]    NA FALSE  TRUE

Here, the NA value remains as such. If we use which

which(d1$v1 >1)
#[1] 3

we get only the index of the TRUE values. According to OP, both the NA and the rows that satisfy the logical condition should return. In that case,

d1[is.na(d1$v1)|d1$v1 > 1,]
# v1         v2
#1 NA -0.2118336
#3  5 -1.1533076


train <- data.frame(YOB = sample(c(NA, 1850:2015), 100, replace=TRUE), 
           col2 = rnorm(100))
like image 134
akrun Avatar answered Oct 21 '22 02:10
