I pulled some data from a SQL database through ODBC and the columns are automatically set to <code>factor</code>. It is something like the following: <pre class="prettyprint"><code>library(RODBC) library(data.table) data <- data.table(sqlQuery(channel, query)) </code></pre> My data looks like this, just with a lot more columns: <pre class="prettyprint"><code>data <- data.table("C1"=as.factor(c(letters[1:4], "NULL", letters[5])), "C2"=as.factor(c(rnorm(3), "NULL", rnorm(2))), "C3"=as.factor(c(letters[1], "NULL", letters[2:4], "NULL"))) > data C1 C2 C3 1: a -0.190200079604691 a 2: b 0.310548914832963 NULL 3: c 0.0153099116493453 b 4: d NULL c 5: NULL 0.157187027626419 d 6: e 0.118537540781528 NULL > str(data) Classes ‘data.table’ and 'data.frame': 6 obs. of 3 variables: $ C1: Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 6 5 $ C2: Factor w/ 6 levels "-0.190200079604691",..: 1 5 2 6 4 3 $ C3: Factor w/ 5 levels "a","b","c","d",..: 1 5 2 3 4 5 - attr(*, ".internal.selfref")=<externalptr> </code></pre> How can I replace the "NULL" with <code>NA</code>? Here I want <code>R</code> to treat these SQL "NULL" strings as missing values <code>NA</code>. I tried the following, but seems <code>NA</code> causes problems. <pre class="prettyprint"><code>for (col in names(data)) { set(data, which(data[[col]]=="NULL"), col, NA) } > Error in set(data, which(data[[col]] == "NULL"), col, NA) : Can't assign to column 'C1' (type 'factor') a value of type 'logical' (not character, factor, integer or numeric) </code></pre> RODBC Solution Thanks to @user20650's suggestion, you can control missing values from <code>sqlQuery</code> by doing <code>data <- data.table(sqlQuery(channel, query, na.strings=c("NA", "NULL")))</code>. However, it is still possible to have this problem if your data source is formatted incorrectly, so this is not an universal solution to the post.

This has the desired effect and is much more compact: <pre class="prettyprint"><code>is.na(data) <- data == "NULL" </code></pre> Note re: comment Q: the <code>is.na</code> function is quite different than the <code>is.na<-</code> function. The latter one used here is an assignment of a value of NA to items that are defined by the logical expression on hte RHS of the assigment operator. There is an <code>is.na.data.frame</code>-method but not an <code>is.na[<-.dataframe</code>-method. So not really sure if this is a purely by-reference strategy, since it is not implemented with the <code>[.data.frame</code> syntax. It may be using "is.na<-.default". I think after noodling around a bit more that "is.na<-.default" (which is just <code>{x[value] <- NA; x}</code> ) so will end up dispatching this call to <code>[<-.data.table</code> so it probably will be done "by reference".

Here's one way: <pre class="prettyprint"><code>data[,names(data):=lapply(.SD,function(x){ z <- levels(x) z[z=="NULL"] <- NA `levels<-`(x,z) })] </code></pre> To see what happened, look at <code>lapply(data,levels)</code> and you'll see that <code>"NULL"</code> is gone. <hr> (Thanks, @akrun:) A more concise and intuitive variant is available using the <code>car</code> package: <pre class="prettyprint"><code>library(car) data[,names(data):=lapply(.SD, recode, '"NULL"=NA')] </code></pre> <hr> In the <code>data.table</code> world, it's usually possible to modify by reference. In this case, that looks like... <pre class="prettyprint"><code>for (j in names(data)) setattr(data[[j]],"levels",{ z <- levels(data[[j]]) z[z=="NULL"] <- NA z }) </code></pre> This avoids making a copy of the whole vector, as <code>`levels<-`</code> does.

R data.table replace "NULL" with `NA` when columns are factors

Q: How do you replace null values in a column in R?

To replace NA with 0 in an R data frame, use is.na() function and then select all those values with NA and assign them to 0.

Q: How do I replace missing values in R?

To replace missing values in R with the minimum, you can use the tidyverse package. Firstly, you use the mutate() function to specify the column in which you want to replace the missing values. Secondly, you call the replace() function to identify the NA's and to substitute them with the column lowest value.

Q: What to replace null values with?

Null values are replaced with mean/median.

Tags:

r

data.table

I pulled some data from a SQL database through ODBC and the columns are automatically set to factor. It is something like the following:

library(RODBC)
library(data.table)
data <- data.table(sqlQuery(channel, query))

My data looks like this, just with a lot more columns:

data <- data.table("C1"=as.factor(c(letters[1:4], "NULL", letters[5])),
                   "C2"=as.factor(c(rnorm(3), "NULL", rnorm(2))),
                   "C3"=as.factor(c(letters[1], "NULL", letters[2:4], "NULL")))
> data
     C1                 C2   C3
1:    a -0.190200079604691    a
2:    b  0.310548914832963 NULL
3:    c 0.0153099116493453    b
4:    d               NULL    c
5: NULL  0.157187027626419    d
6:    e  0.118537540781528 NULL
> str(data)
Classes ‘data.table’ and 'data.frame':  6 obs. of  3 variables:
 $ C1: Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 6 5
 $ C2: Factor w/ 6 levels "-0.190200079604691",..: 1 5 2 6 4 3
 $ C3: Factor w/ 5 levels "a","b","c","d",..: 1 5 2 3 4 5
 - attr(*, ".internal.selfref")=<externalptr>

How can I replace the "NULL" with NA? Here I want R to treat these SQL "NULL" strings as missing values NA. I tried the following, but seems NA causes problems.

for (col in names(data)) {
  set(data, which(data[[col]]=="NULL"), col, NA)
}

> Error in set(data, which(data[[col]] == "NULL"), col, NA) : 
  Can't assign to column 'C1' (type 'factor') a value of type 'logical' (not character, factor, integer or numeric)

RODBC Solution

Thanks to @user20650's suggestion, you can control missing values from sqlQuery by doing data <- data.table(sqlQuery(channel, query, na.strings=c("NA", "NULL"))). However, it is still possible to have this problem if your data source is formatted incorrectly, so this is not an universal solution to the post.

782

asked Jun 12 '15 19:06

Boxuan

2 Answers

This has the desired effect and is much more compact:

is.na(data) <- data == "NULL"

Note re: comment Q: the is.na function is quite different than the is.na<- function. The latter one used here is an assignment of a value of NA to items that are defined by the logical expression on hte RHS of the assigment operator. There is an is.na.data.frame-method but not an is.na[<-.dataframe-method. So not really sure if this is a purely by-reference strategy, since it is not implemented with the [.data.frame syntax. It may be using "is.na<-.default".

I think after noodling around a bit more that "is.na<-.default" (which is just {x[value] <- NA; x} ) so will end up dispatching this call to [<-.data.table so it probably will be done "by reference".

162

answered Jan 04 '23 04:01

IRTFM

Here's one way:

data[,names(data):=lapply(.SD,function(x){
  z <- levels(x)
  z[z=="NULL"] <- NA
  `levels<-`(x,z)
})]

To see what happened, look at lapply(data,levels) and you'll see that "NULL" is gone.

(Thanks, @akrun:) A more concise and intuitive variant is available using the car package:

library(car)
data[,names(data):=lapply(.SD, recode, '"NULL"=NA')]

In the data.table world, it's usually possible to modify by reference. In this case, that looks like...

for (j in names(data)) setattr(data[[j]],"levels",{
  z <- levels(data[[j]])
  z[z=="NULL"] <- NA
  z
})

This avoids making a copy of the whole vector, as `levels<-` does.

answered Jan 04 '23 04:01

Frank

Related questions
                            
                                How to use Tor socks5 in R getURL
                            
                                Create a Correlation Matrix From a Correlation Vector in R
                            
                                extracting value based on another column
                            
                                Print the last row from a list of data frames
                            
                                How do I speed up text searches in R?
                            
                                How to extract values from rasterstack with xy coordinates?
                            
                                Plot negative values in logarithmic scale with ggplot 2
                            
                                Read files with extension .data into R
                            
                                R: splitting a string between two characters using strsplit()
                            
                                Add second x-axis in ggplot2
                            
                                How to do in R: load an image file, print text on image, save modified image [duplicate]
                            
                                Default argument in R function (formal argument matched by multiple actual arguments)
                            
                                Using xpathSApply to scrape XML attributes in R
                            
                                How to generate symmetric random matrix?
                            
                                Summing by groups of rows in R
                            
                                Install R 3.1.2 on Wheezy 7.8 (Raspbian) fails
                            
                                Create combinations of a binary vector
                            
                                Split column string elements within a row inside a dataframe
                            
                                Fastest way to filter a data.frame list column contents in R / Rcpp
                            
                                Why does not GridSearchCV give best score ? - Scikit Learn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With