Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table replace "NULL" with `NA` when columns are factors

Tags:

r

data.table

I pulled some data from a SQL database through ODBC and the columns are automatically set to factor. It is something like the following:

library(RODBC)
library(data.table)
data <- data.table(sqlQuery(channel, query))

My data looks like this, just with a lot more columns:

data <- data.table("C1"=as.factor(c(letters[1:4], "NULL", letters[5])),
                   "C2"=as.factor(c(rnorm(3), "NULL", rnorm(2))),
                   "C3"=as.factor(c(letters[1], "NULL", letters[2:4], "NULL")))
> data
     C1                 C2   C3
1:    a -0.190200079604691    a
2:    b  0.310548914832963 NULL
3:    c 0.0153099116493453    b
4:    d               NULL    c
5: NULL  0.157187027626419    d
6:    e  0.118537540781528 NULL
> str(data)
Classes ‘data.table’ and 'data.frame':  6 obs. of  3 variables:
 $ C1: Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 6 5
 $ C2: Factor w/ 6 levels "-0.190200079604691",..: 1 5 2 6 4 3
 $ C3: Factor w/ 5 levels "a","b","c","d",..: 1 5 2 3 4 5
 - attr(*, ".internal.selfref")=<externalptr> 

How can I replace the "NULL" with NA? Here I want R to treat these SQL "NULL" strings as missing values NA. I tried the following, but seems NA causes problems.

for (col in names(data)) {
  set(data, which(data[[col]]=="NULL"), col, NA)
}

> Error in set(data, which(data[[col]] == "NULL"), col, NA) : 
  Can't assign to column 'C1' (type 'factor') a value of type 'logical' (not character, factor, integer or numeric)

RODBC Solution

Thanks to @user20650's suggestion, you can control missing values from sqlQuery by doing data <- data.table(sqlQuery(channel, query, na.strings=c("NA", "NULL"))). However, it is still possible to have this problem if your data source is formatted incorrectly, so this is not an universal solution to the post.

like image 782
Boxuan Avatar asked Jun 12 '15 19:06

Boxuan


People also ask

How do you replace null values in a column in R?

To replace NA with 0 in an R data frame, use is.na() function and then select all those values with NA and assign them to 0.

How do I replace missing values in R?

To replace missing values in R with the minimum, you can use the tidyverse package. Firstly, you use the mutate() function to specify the column in which you want to replace the missing values. Secondly, you call the replace() function to identify the NA's and to substitute them with the column lowest value.

What to replace null values with?

Null values are replaced with mean/median.


2 Answers

This has the desired effect and is much more compact:

is.na(data) <- data == "NULL"

Note re: comment Q: the is.na function is quite different than the is.na<- function. The latter one used here is an assignment of a value of NA to items that are defined by the logical expression on hte RHS of the assigment operator. There is an is.na.data.frame-method but not an is.na[<-.dataframe-method. So not really sure if this is a purely by-reference strategy, since it is not implemented with the [.data.frame syntax. It may be using "is.na<-.default".

I think after noodling around a bit more that "is.na<-.default" (which is just {x[value] <- NA; x} ) so will end up dispatching this call to [<-.data.table so it probably will be done "by reference".

like image 162
IRTFM Avatar answered Jan 04 '23 04:01

IRTFM


Here's one way:

data[,names(data):=lapply(.SD,function(x){
  z <- levels(x)
  z[z=="NULL"] <- NA
  `levels<-`(x,z)
})]

To see what happened, look at lapply(data,levels) and you'll see that "NULL" is gone.


(Thanks, @akrun:) A more concise and intuitive variant is available using the car package:

library(car)
data[,names(data):=lapply(.SD, recode, '"NULL"=NA')]

In the data.table world, it's usually possible to modify by reference. In this case, that looks like...

for (j in names(data)) setattr(data[[j]],"levels",{
  z <- levels(data[[j]])
  z[z=="NULL"] <- NA
  z
})

This avoids making a copy of the whole vector, as `levels<-` does.

like image 35
Frank Avatar answered Jan 04 '23 04:01

Frank