key issue: using setattr
to change level names, keeps unwanted duplicates.
I am cleaning some data where I have sevearl factor levels, all of which are the same, appearing as two or more distinct levels. (This error is due mostly to typos and file encoding issues) I have 153K factors, and abot 5% need to be corrected.
Example
In the following example, the vector has three levels, two of which need to be collapsed into one.
incorrect <- factor(c("AOB", "QTX", "A_B")) # this is how the data were entered
correct <- factor(c("AOB", "QTX", "AOB")) # this is how the data *should* be
> incorrect
[1] AOB QTX A_B
Levels: A_B AOB QTX <~~ Note that "A_B" should be "AOB"
> correct
[1] AOB QTX AOB
Levels: AOB QTX
The vector is part of a data.table
.
Everything works fine when using the levels<-
function to change the level names.
However, if using setattr
, then unwanted duplicates are preserved.
mydt1 <- data.table(id=1:3, incorrect, key="id")
mydt2 <- data.table(id=1:3, incorrect, key="id")
# assigning levels, duplicate levels are dropped
levels(mydt1$incorrect) <- gsub("_", "O", levels(mydt1$incorrect))
# using setattr, duplicate levels are not dropped
setattr(mydt2$incorrect, "levels", gsub("_", "O", levels(mydt2$incorrect)))
# RESULTS
# Assigning Levels # Using `setattr`
> mydt1$incorrect > mydt2$incorrect
[1] AOB QTX AOB [1] AOB QTX AOB
Levels: AOB QTX Levels: AOB AOB QTX <~~~ Notice the duplicate level
Any thoughts on why this is and/or any options to change this behavior? (ie ..., droplevels=TRUE
?)
Thanks
setattr
is a low level, brute force way to change attributes by reference. It doesn't know that the "levels" attribute is special. levels<-
has more functionality inside it, but I suspect you may have found that levels(DT$col)<-newlevels
will copy the whole of DT
(base <-
), hence for speed you looked to setattr
.
I wouldn't say incorrect btw. It's a correct and valid factor, but just happens to have duplicate levels.
To drop the duplicate levels, I think (untested) :
mydt1[,factorCol:=factor(factorCol)]
should do it. It's possible to go faster than that by finding which levels you've changed, changing the integers to point to the first one of duplicates and then remove the dups from the levels. The call to factor()
basically starts from scratch (i.e. coerces all of the factor to character
and rematches).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With