What is the most effective (ie efficient / appropriate) way to clean up a factor containing multiple levels that need to be collapsed? That is, how to combine two or more factor levels into one.
Here's an example where the two levels "Yes" and "Y" should be collapsed to "Yes", and "No" and "N" collapsed to "No":
## Given: x <- c("Y", "Y", "Yes", "N", "No", "H") # The 'H' should be treated as NA ## expectedOutput [1] Yes Yes Yes No No <NA> Levels: Yes No # <~~ NOTICE ONLY **TWO** LEVELS
One option is of course to clean the strings before hand using sub
and friends.
Another method, is to allow duplicate label, then drop them
## Duplicate levels ==> "Warning: deprecated" x.f <- factor(x, levels=c("Y", "Yes", "No", "N"), labels=c("Yes", "Yes", "No", "No")) ## the above line can be wrapped in either of the next two lines factor(x.f) droplevels(x.f)
However, is there a more effective way?
While I know that the levels
and labels
arguments should be vectors, I experimented with lists and named lists and named vectors to see what happens Needless to say, none of the following got me any closer to my goal.
factor(x, levels=list(c("Yes", "Y"), c("No", "N")), labels=c("Yes", "No")) factor(x, levels=c("Yes", "No"), labels=list(c("Yes", "Y"), c("No", "N"))) factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Y="Yes", Yes="Yes", No="No", N="No")) factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Yes="Y", Yes="Yes", No="No", No="N")) factor(x, levels=c("Yes", "No"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))
UPDATE 2: See Uwe's answer which shows the new "tidyverse" way of doing this, which is quickly becoming the standard.
UPDATE 1: Duplicated labels (but not levels!) are now indeed allowed (per my comment above); see Tim's answer.
ORIGINAL ANSWER, BUT STILL USEFUL AND OF INTEREST: There is a little known option to pass a named list to the levels
function, for exactly this purpose. The names of the list should be the desired names of the levels and the elements should be the current names that should be renamed. Some (including the OP, see Ricardo's comment to Tim's answer) prefer this for ease of reading.
x <- c("Y", "Y", "Yes", "N", "No", "H", NA) x <- factor(x) levels(x) <- list("Yes"=c("Y", "Yes"), "No"=c("N", "No")) x ## [1] Yes Yes Yes No No <NA> <NA> ## Levels: Yes No
As mentioned in the levels
documentation; also see the examples there.
value: For the 'factor' method, a vector of character strings with length at least the number of levels of 'x', or a named list specifying how to rename the levels.
This can also be done in one line, as Marek does here: https://stackoverflow.com/a/10432263/210673; the levels<-
sorcery is explained here https://stackoverflow.com/a/10491881/210673.
> `levels<-`(factor(x), list(Yes=c("Y", "Yes"), No=c("N", "No"))) [1] Yes Yes Yes No No <NA> Levels: Yes No
As the question is titled Cleaning up factor levels (collapsing multiple levels/labels), the forcats
package should be mentioned here as well, for the sake of completeness. forcats
appeared on CRAN in August 2016.
There are several convenience functions available for cleaning up factor levels:
x <- c("Y", "Y", "Yes", "N", "No", "H") library(forcats)
fct_collapse(x, Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H") #[1] Yes Yes Yes No No <NA> #Levels: No Yes
fct_recode(x, Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H") #[1] Yes Yes Yes No No <NA> #Levels: No Yes
fun <- function(z) { z[z == "Y"] <- "Yes" z[z == "N"] <- "No" z[!(z %in% c("Yes", "No"))] <- NA z } fct_relabel(factor(x), fun) #[1] Yes Yes Yes No No <NA> #Levels: No Yes
Note that fct_relabel()
works with factor levels, so it expects a factor as first argument. The two other functions, fct_collapse()
and fct_recode()
, accept also a character vector which is an undocumented feature.
The expected output given by the OP is
[1] Yes Yes Yes No No <NA> Levels: Yes No
Here the levels are ordered as they appear in x
which is different from the default (?factor
: The levels of a factor are by default sorted).
To be in line with the expected output, this can be achieved by using fct_inorder()
before collapsing the levels:
fct_collapse(fct_inorder(x), Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H") fct_recode(fct_inorder(x), Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")
Both return the expected output with levels in the same order, now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With