Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

`setattr` on `levels` preserving unwanted duplicates (R data.table)

key issue: using setattr to change level names, keeps unwanted duplicates.

I am cleaning some data where I have sevearl factor levels, all of which are the same, appearing as two or more distinct levels. (This error is due mostly to typos and file encoding issues) I have 153K factors, and abot 5% need to be corrected.

Example

In the following example, the vector has three levels, two of which need to be collapsed into one.

  incorrect <- factor(c("AOB", "QTX", "A_B"))   # this is how the data were entered
  correct   <- factor(c("AOB", "QTX", "AOB"))   # this is how the data *should* be

  > incorrect
  [1] AOB QTX A_B
  Levels: A_B AOB QTX   <~~ Note that "A_B" should be "AOB"

  > correct
  [1] AOB QTX AOB
  Levels: AOB QTX

The vector is part of a data.table.
Everything works fine when using the levels<- function to change the level names.
However, if using setattr, then unwanted duplicates are preserved.

mydt1 <- data.table(id=1:3, incorrect, key="id")
mydt2 <- data.table(id=1:3, incorrect, key="id")



# assigning levels, duplicate levels are dropped
levels(mydt1$incorrect) <- gsub("_", "O", levels(mydt1$incorrect))

# using setattr, duplicate levels are not dropped
setattr(mydt2$incorrect, "levels", gsub("_", "O", levels(mydt2$incorrect)))

                # RESULTS
# Assigning Levels       # Using `setattr`
> mydt1$incorrect        >     mydt2$incorrect
[1] AOB QTX AOB          [1] AOB QTX AOB
Levels: AOB QTX          Levels: AOB AOB QTX   <~~~ Notice the duplicate level

Any thoughts on why this is and/or any options to change this behavior? (ie ..., droplevels=TRUE ?) Thanks

like image 669
Ricardo Saporta Avatar asked Feb 07 '13 17:02

Ricardo Saporta


1 Answers

setattr is a low level, brute force way to change attributes by reference. It doesn't know that the "levels" attribute is special. levels<- has more functionality inside it, but I suspect you may have found that levels(DT$col)<-newlevels will copy the whole of DT (base <-), hence for speed you looked to setattr.

I wouldn't say incorrect btw. It's a correct and valid factor, but just happens to have duplicate levels.

To drop the duplicate levels, I think (untested) :

mydt1[,factorCol:=factor(factorCol)]

should do it. It's possible to go faster than that by finding which levels you've changed, changing the integers to point to the first one of duplicates and then remove the dups from the levels. The call to factor() basically starts from scratch (i.e. coerces all of the factor to character and rematches).

like image 183
Matt Dowle Avatar answered Sep 21 '22 01:09

Matt Dowle