One of the basic data types in R is factors. In my experience factors are basically a pain and I never use them. I always convert to characters. I feel oddly like I'm missing something.
Are there some important examples of functions that use factors as grouping variables where the factor data type becomes necessary? Are there specific circumstances when I should be using factors?
Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. The factor function is used to create a factor. The only required argument to factor is a vector of values which will be returned as a vector of factor values.
In R, you can convert multiple numeric variables to factor using lapply function. The lapply function is a part of apply family of functions. They perform multiple iterations (loops) in R. In R, categorical variables need to be set as factor variables.
How do I Rename Factor Levels in R? The simplest way to rename multiple factor levels is to use the levels() function. For example, to recode the factor levels “A”, “B”, and “C” you can use the following code: levels(your_df$Category1) <- c("Factor 1", "Factor 2", "Factor 3") .
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. Historically, factors were much easier to work with than characters.
You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table
and read.csv
, the argument stringsAsFactors = TRUE
by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot
and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:
lm(Petal.Length ~ -1 + Species, data=iris) # Call: # lm(formula = Petal.Length ~ -1 + Species, data = iris) # Coefficients: # Speciessetosa Speciesversicolor Speciesvirginica # 1.462 4.260 5.552 iris.alt <- iris iris.alt$Species <- as.character(iris.alt$Species) lm(Petal.Length ~ -1 + Species, data=iris.alt) # Call: # lm(formula = Petal.Length ~ -1 + Species, data = iris.alt) # Coefficients: # Speciessetosa Speciesversicolor Speciesvirginica # 1.462 4.260 5.552
Warning message: In
model.matrix.default(mt, mf, contrasts)
:variable
Species
converted to afactor
One tricky thing is the whole drop=TRUE
bit. In vectors this works well to remove levels of factors that aren't in the data. For example:
s <- iris$Species s[s == 'setosa', drop=TRUE] # [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # Levels: setosa s[s == 'setosa', drop=FALSE] # [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # Levels: setosa versicolor virginica
However, with data.frame
s, the behavior of [.data.frame()
is different: see this email or ?"[.data.frame"
. Using drop=TRUE
on data.frame
s does not work as you'd imagine:
x <- subset(iris, Species == 'setosa', drop=TRUE) # susbetting with [ behaves the same way x$Species # [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # Levels: setosa versicolor virginica
Luckily you can drop factors easily with droplevels()
to drop unused factor levels for an individual factor or for every factor in a data.frame
(since R 2.12):
x <- subset(iris, Species == 'setosa') levels(x$Species) # [1] "setosa" "versicolor" "virginica" x <- droplevels(x) levels(x$Species) # [1] "setosa"
This is how to keep levels you've selected out from getting in ggplot
legends.
Internally, factor
s are integers with an attribute level character vector (see attributes(iris$Species)
and class(attributes(iris$Species)$levels)
), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot
legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With