In the data frame there is a variable called YOB
. As you can see, there are 333 NA
values.
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1880 1970 1983 1980 1993 2039 333
I identified some outliers and want to get rid of them. Anything less than 1900 and greater than 2003 shall be removed. I tried to do this by indexing.
train = train[which(train$YOB >= 1900 & train$YOB <= 2003),]
Unfortunately observations whose YOB
variable were NA
are also removed.
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1900 1970 1983 1980 1993 2003
On a side note, I face the same problem when using subset
command.
> train = subset(train, YOB >= 1900 & YOB <= 2003)
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1900 1970 1983 1980 1993 2003
I have also tried to use this condition in both attempts, but with no success, e.g.
> train = train[which(!is.na(train$YOB) & train$YOB >= 1900 & train$YOB <= 2003),]
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1900 1970 1983 1980 1993 2003
I would like to keep the observations that have NA
in the YOB
variable and only remove those that are numeric. The idea is in a second step to impute missing values.
The which
will give the numeric index and skip all those NA rows. To avoid that, use the logical index without wrapping with which
. The index will be NA in that way and that row will remain as NA even if there are other values that are non-NA.
res1 <- train[train$YOB >= 1900 & train$YOB <= 2003,]
res1[is.na(res1$YOB),]
# YOB col2
#NA NA NA
The correct way would be to have another condition with is.na
res2 <- train[is.na(train$YOB)| (train$YOB >= 1900 & train$YOB <= 2003),]
res2[is.na(res2$YOB),]
# YOB col2
#42 NA 0.2258094
Using a simple example
set.seed(25)
d1 <- data.frame(v1 = c(NA, 1, 5), v2 = rnorm(3))
d1$v1 >1
#[1] NA FALSE TRUE
Here, the NA
value remains as such. If we use which
which(d1$v1 >1)
#[1] 3
we get only the index of the TRUE values. According to OP, both the NA and the rows that satisfy the logical condition should return. In that case,
d1[is.na(d1$v1)|d1$v1 > 1,]
# v1 v2
#1 NA -0.2118336
#3 5 -1.1533076
set.seed(29)
train <- data.frame(YOB = sample(c(NA, 1850:2015), 100, replace=TRUE),
col2 = rnorm(100))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With