Factors in R: more than an annoyance?

Q: Why are factors in R useful?

In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. Historically, factors were much easier to work with than characters.

1 Answers

You should use factors. Yes they can be a pain, but my theory is that 90% of why they're a pain is because in read.table and read.csv, the argument stringsAsFactors = TRUE by default (and most users miss this subtlety). I say they are useful because model fitting packages like lme4 use factors and ordered factors to differentially fit models and determine the type of contrasts to use. And graphing packages also use them to group by. ggplot and most model fitting functions coerce character vectors to factors, so the result is the same. However, you end up with warnings in your code:

lm(Petal.Length ~ -1 + Species, data=iris)  # Call: # lm(formula = Petal.Length ~ -1 + Species, data = iris)  # Coefficients: #     Speciessetosa  Speciesversicolor   Speciesvirginica   #             1.462              4.260              5.552    iris.alt <- iris iris.alt$Species <- as.character(iris.alt$Species) lm(Petal.Length ~ -1 + Species, data=iris.alt)  # Call: # lm(formula = Petal.Length ~ -1 + Species, data = iris.alt)  # Coefficients: #     Speciessetosa  Speciesversicolor   Speciesvirginica   #             1.462              4.260              5.552

Warning message: In model.matrix.default(mt, mf, contrasts) :

variable Species converted to a factor

One tricky thing is the whole drop=TRUE bit. In vectors this works well to remove levels of factors that aren't in the data. For example:

s <- iris$Species s[s == 'setosa', drop=TRUE] #  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # Levels: setosa s[s == 'setosa', drop=FALSE] #  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # Levels: setosa versicolor virginica

However, with data.frames, the behavior of [.data.frame() is different: see this email or ?"[.data.frame". Using drop=TRUE on data.frames does not work as you'd imagine:

x <- subset(iris, Species == 'setosa', drop=TRUE)  # susbetting with [ behaves the same way x$Species #  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [11] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [21] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [31] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # [41] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa # Levels: setosa versicolor virginica

Luckily you can drop factors easily with droplevels() to drop unused factor levels for an individual factor or for every factor in a data.frame (since R 2.12):

x <- subset(iris, Species == 'setosa') levels(x$Species) # [1] "setosa"     "versicolor" "virginica"  x <- droplevels(x) levels(x$Species) # [1] "setosa"

This is how to keep levels you've selected out from getting in ggplot legends.

Internally, factors are integers with an attribute level character vector (see attributes(iris$Species) and class(attributes(iris$Species)$levels)), which is clean. If you had to change a level name (and you were using character strings), this would be a much less efficient operation. And I change level names a lot, especially for ggplot legends. If you fake factors with character vectors, there's the risk that you'll change just one element, and accidentally create a separate new level.

144

answered Oct 10 '22 04:10

Vince

Related questions
                            
                                How to tell what is in one vector and not another?
                            
                                Arbitrary sections in roxygen docs
                            
                                Explain ggplot2 warning: "Removed k rows containing missing values"
                            
                                break/exit script
                            
                                What are 'user' and 'system' times measuring in R system.time(exp) output?
                            
                                Split a large dataframe into a list of data frames based on common value in column
                            
                                Select multiple elements from a list
                            
                                Why is message() a better choice than print() in R for writing a package?
                            
                                Skipping error in for-loop
                            
                                dplyr mutate/replace several columns on a subset of rows
                            
                                Extract the first (or last) n characters of a string
                            
                                Plot correlation matrix into a graph
                            
                                How to add a ggplot2 subtitle with different size and colour?
                            
                                Plot labels at ends of lines
                            
                                Read an Excel file directly from a R script
                            
                                How to fit a smooth curve to my data in R?
                            
                                Read fixed width text file
                            
                                Extract year from date
                            
                                Move a column to first position in a data frame
                            
                                geom_smooth() what are the methods available?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Factors in R: more than an annoyance?

Tags:

language-design

r

internals

r-factor

JD Long

People also ask

1 Answers

Vince

Recent Activity

Donate For Us