Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set certain values to NA with dplyr

Tags:

r

dplyr

I'm trying to figure out a simple way to do something like this with dplyr (data set = dat, variable = x):

day$x[dat$x<0]=NA 

Should be simple but this is the best I can do at the moment. Is there an easier way?

dat =  dat %>% mutate(x=ifelse(x<0,NA,x)) 
like image 690
Glen Avatar asked Jan 12 '15 19:01

Glen


People also ask

How do I assign a value to NA in R?

To replace NA with 0 in an R data frame, use is.na() function and then select all those values with NA and assign them to 0.

How do I replace specific values in R?

replace() function in R Language is used to replace the values in the specified string vector x with indices given in list by those given in values. It takes on three parameters first is the list name, then the index at which the element needs to be replaced, and the third parameter is the replacement values.

How do I make NA missing values in R?

You use the is.na() function to impute all missing values in a column. The replacement value. To replace the NA's with the minimum, you use the min() function.

How do you write NA in R?

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). Unlike SAS, R uses the same symbol for character and numeric data. For more practice on working with missing data, try this course on cleaning data in R.


2 Answers

You can use replace which is a bit faster than ifelse:

dat <-  dat %>% mutate(x = replace(x, x<0, NA)) 

You can speed it up a bit more by supplying an index to replace using which:

dat <- dat %>% mutate(x = replace(x, which(x<0L), NA)) 

On my machine, this cut the time to a third, see below.

Here's a little comparison of the different answers, which is only indicative of course:

set.seed(24) dat <- data.frame(x=rnorm(1e6)) system.time(dat %>% mutate(x = replace(x, x<0, NA)))        User      System     elapsed        0.03        0.00        0.03  system.time(dat %>% mutate(x=ifelse(x<0,NA,x)))        User      System     elapsed        0.30        0.00        0.29  system.time(setDT(dat)[x<0,x:=NA])        User      System     elapsed        0.01        0.00        0.02  system.time(dat$x[dat$x<0] <- NA)        User      System     elapsed        0.03        0.00        0.03  system.time(dat %>% mutate(x = "is.na<-"(x, x < 0)))        User      System     elapsed        0.05        0.00        0.05  system.time(dat %>% mutate(x = NA ^ (x < 0) * x))        User      System     elapsed        0.01        0.00        0.02  system.time(dat %>% mutate(x = replace(x, which(x<0), NA)))        User      System     elapsed        0.01        0.00        0.01  

(I'm using dplyr_0.3.0.2 and data.table_1.9.4)


Since we're always very interested in benchmarking, especially in the course of data.table-vs-dplyr discussions I provide another benchmark of 3 of the answers using microbenchmark and the data by akrun. Note that I modified dplyr1 to be the updated version of my answer:

set.seed(285) dat1 <- dat <- data.frame(x=sample(-5:5, 1e8, replace=TRUE), y=rnorm(1e8)) dtbl1 <- function() {setDT(dat)[x<0,x:=NA]} dplr1 <- function() {dat1 %>% mutate(x = replace(x, which(x<0L), NA))} dplr2 <- function() {dat1 %>% mutate(x = NA ^ (x < 0) * x)} microbenchmark(dtbl1(), dplr1(), dplr2(), unit='relative', times=20L) #Unit: relative #    expr      min       lq   median       uq      max neval # dtbl1() 1.091208 4.319863 4.194086 4.162326 4.252482    20 # dplr1() 1.000000 1.000000 1.000000 1.000000 1.000000    20 # dplr2() 6.251354 5.529948 5.344294 5.311595 5.190192    20 
like image 165
talat Avatar answered Sep 20 '22 08:09

talat


You can use the is.na<- function:

dat %>% mutate(x = "is.na<-"(x, x < 0)) 

Or you can use mathematical operators:

dat %>% mutate(x = NA ^ (x < 0) * x) 
like image 41
Sven Hohenstein Avatar answered Sep 18 '22 08:09

Sven Hohenstein