My dataframe has some variables that contain missing values as strings like "NA"
. What is the most efficient way to parse all columns in a dataframe that contain these and convert them into real NAs that are catched by functions like is.na()
?
I am using sqldf to query the database.
Reproducible example:
vect1 <- c("NA", "NA", "BANANA", "HELLO")
vect2 <- c("NA", 1, 5, "NA")
vect3 <- c(NA, NA, "NA", "NA")
df = data.frame(vect1,vect2,vect3)
To add to the alternatives, you can also use replace
instead of the typical blah[index] <- NA
approach. replace
would look like:
df <- replace(df, df == "NA", NA)
Another alternative to consider is type.convert
. This is the function that R uses when reading data in to automatically convert column types. Thus, the result is different from your current approach in that, for instance, the second column gets converted to numeric.
df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df
Here's a performance comparison. The sample data is from @roland's answer.
Here are the functions to test:
funop <- function() {
df[df == "NA"] <- NA
df
}
funr <- function() {
ind <- which(vapply(df, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
as.data.table(df)[, names(df)[ind] := lapply(.SD, function(x) {
is.na(x) <- x == "NA"
x
}), .SDcols = ind][]
}
funam1 <- function() replace(df, df == "NA", NA)
funam2 <- function() {
df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df
}
Here's the benchmarking:
library(microbenchmark)
microbenchmark(funop(), funr(), funam1(), funam2(), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# funop() 3.629832 3.750853 3.909333 3.855636 4.098086 4.248287 10
# funr() 3.074825 3.212499 3.320430 3.279268 3.332304 3.685837 10
# funam1() 3.714561 3.899456 4.238785 4.065496 4.280626 5.512706 10
# funam2() 1.391315 1.455366 1.623267 1.566486 1.606694 2.253258 10
replace
would be the same as @roland's approach, which is the same as @jgozal's. However, the type.convert
approach would result in different column types.
all.equal(funop(), setDF(funr()))
all.equal(funop(), funam())
str(funop())
# 'data.frame': 10000000 obs. of 3 variables:
# $ vect1: Factor w/ 3 levels "BANANA","HELLO",..: 2 2 NA 2 1 1 1 NA 1 1 ...
# $ vect2: Factor w/ 3 levels "1","5","NA": NA 2 1 NA 1 NA NA 1 NA 2 ...
# $ vect3: Factor w/ 1 level "NA": NA NA NA NA NA NA NA NA NA NA ...
str(funam2())
# 'data.frame': 10000000 obs. of 3 variables:
# $ vect1: Factor w/ 2 levels "BANANA","HELLO": 2 2 NA 2 1 1 1 NA 1 1 ...
# $ vect2: int NA 5 1 NA 1 NA NA 1 NA 5 ...
# $ vect3: logi NA NA NA NA NA NA ...
I found this nice way of doing it from this question:
So for this particular situation it would just be:
df[df=="NA"]<-NA
It only took about 30 seconds with 5 million rows and ~250 variables
This is slightly faster:
set.seed(42)
df <- do.call(data.frame, lapply(df, sample, size = 1e7, replace = TRUE))
df2 <- df
system.time(df[df=="NA"]<-NA )
# user system elapsed
#3.601 0.378 3.984
library(data.table)
setDT(df2)
system.time({
#find character and factor columns
ind <- which(vapply(df2, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
#assign by reference
df2[, names(df2)[ind] := lapply(.SD, function(x) {
is.na(x) <- x == "NA"
x
}), .SDcols = ind]
})
# user system elapsed
#2.484 0.190 2.676
all.equal(df, setDF(df2))
#[1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With