I pulled some data from a SQL database through ODBC and the columns are automatically set to factor
. It is something like the following:
library(RODBC)
library(data.table)
data <- data.table(sqlQuery(channel, query))
My data looks like this, just with a lot more columns:
data <- data.table("C1"=as.factor(c(letters[1:4], "NULL", letters[5])),
"C2"=as.factor(c(rnorm(3), "NULL", rnorm(2))),
"C3"=as.factor(c(letters[1], "NULL", letters[2:4], "NULL")))
> data
C1 C2 C3
1: a -0.190200079604691 a
2: b 0.310548914832963 NULL
3: c 0.0153099116493453 b
4: d NULL c
5: NULL 0.157187027626419 d
6: e 0.118537540781528 NULL
> str(data)
Classes ‘data.table’ and 'data.frame': 6 obs. of 3 variables:
$ C1: Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 6 5
$ C2: Factor w/ 6 levels "-0.190200079604691",..: 1 5 2 6 4 3
$ C3: Factor w/ 5 levels "a","b","c","d",..: 1 5 2 3 4 5
- attr(*, ".internal.selfref")=<externalptr>
How can I replace the "NULL" with NA
? Here I want R
to treat these SQL "NULL" strings as missing values NA
. I tried the following, but seems NA
causes problems.
for (col in names(data)) {
set(data, which(data[[col]]=="NULL"), col, NA)
}
> Error in set(data, which(data[[col]] == "NULL"), col, NA) :
Can't assign to column 'C1' (type 'factor') a value of type 'logical' (not character, factor, integer or numeric)
RODBC Solution
Thanks to @user20650's suggestion, you can control missing values from sqlQuery
by doing data <- data.table(sqlQuery(channel, query, na.strings=c("NA", "NULL")))
. However, it is still possible to have this problem if your data source is formatted incorrectly, so this is not an universal solution to the post.
To replace NA with 0 in an R data frame, use is.na() function and then select all those values with NA and assign them to 0.
To replace missing values in R with the minimum, you can use the tidyverse package. Firstly, you use the mutate() function to specify the column in which you want to replace the missing values. Secondly, you call the replace() function to identify the NA's and to substitute them with the column lowest value.
Null values are replaced with mean/median.
This has the desired effect and is much more compact:
is.na(data) <- data == "NULL"
Note re: comment Q: the is.na
function is quite different than the is.na<-
function. The latter one used here is an assignment of a value of NA to items that are defined by the logical expression on hte RHS of the assigment operator. There is an is.na.data.frame
-method but not an is.na[<-.dataframe
-method. So not really sure if this is a purely by-reference strategy, since it is not implemented with the [.data.frame
syntax. It may be using "is.na<-.default".
I think after noodling around a bit more that "is.na<-.default" (which is just {x[value] <- NA; x}
) so will end up dispatching this call to [<-.data.table
so it probably will be done "by reference".
Here's one way:
data[,names(data):=lapply(.SD,function(x){
z <- levels(x)
z[z=="NULL"] <- NA
`levels<-`(x,z)
})]
To see what happened, look at lapply(data,levels)
and you'll see that "NULL"
is gone.
(Thanks, @akrun:) A more concise and intuitive variant is available using the car
package:
library(car)
data[,names(data):=lapply(.SD, recode, '"NULL"=NA')]
In the data.table
world, it's usually possible to modify by reference. In this case, that looks like...
for (j in names(data)) setattr(data[[j]],"levels",{
z <- levels(data[[j]])
z[z=="NULL"] <- NA
z
})
This avoids making a copy of the whole vector, as `levels<-`
does.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With