Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to subset data in R without losing NA rows?

I have some data that I am looking at in R. One particular column, titled "Height", contains a few rows of NA.

I am looking to subset my data-frame so that all Heights above a certain value are excluded from my analysis.

df2 <- subset ( df1 , Height < 40 )

However whenever I do this, R automatically removes all rows that contain NA values for Height. I do not want this. I have tried including arguments for na.rm

f1 <- function ( x , na.rm = FALSE ) {
df2 <- subset ( x , Height < 40 )
}
f1 ( df1 , na.rm = FALSE )

but this does not seem to do anything; the rows with NA still end up disappearing from my data-frame. Is there a way of subsetting my data as such, without losing the NA rows?

like image 468
Ryan Rothman Avatar asked Nov 06 '16 05:11

Ryan Rothman


People also ask

How do you subset data without NA in R?

To select rows of an R data frame that are non-Na, we can use complete. cases function with single square brackets. For example, if we have a data frame called that contains some missing values (NA) then the selection of rows that are non-NA can be done by using the command df[complete. cases(df),].

How do I subset certain rows in R?

By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.

How do I ignore rows with NA in R?

To remove all rows having NA, we can use na. omit function. For Example, if we have a data frame called df that contains some NA values then we can remove all rows that contains at least one NA by using the command na. omit(df).

How do I replace Na with missing in R?

You can replace NA values with blank space on columns of R dataframe (data. frame) by using is.na() , replace() methods.


1 Answers

If we decide to use subset function, then we need to watch out:

For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.

So only non-NA values will be retained.

If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:

subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`

Don't use directly (to be explained soon):

df2 <- df1[df1$Height < 40, ]

Example

df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)

subset(df1, Height < 40 | is.na(Height))

#  Height y
#1     NA 1
#2      2 2
#3      4 3
#4     NA 4

df1[df1$Height < 40, ]

#  Height  y
#1     NA NA
#2      2  2
#3      4  3
#4     NA NA

The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:

x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA  2 NA

We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):

x[ind | is.na(ind)]
# [1] 1 2 3

This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.

like image 64
Zheyuan Li Avatar answered Sep 28 '22 18:09

Zheyuan Li