What are the practical differences between 'factor' and 'string' data types in R?

Tags:

r

From other programming languages I am familiar with the string data type. In addition to this data type, R also has the factor data type. I am new to the R language, so I am trying to wrap my head around the intent behind this new data type.

Question: What are the practical differences between 'factor' and 'string' data types in R?

I get that (on a conceptual/philosophical level) the factor data type is supposed to encode the values of a categorical random variable, but I do not understand (on a practical level) why the string data type would be insufficient for this purpose.

Seemingly having duplicate data types which serve the same practical purpose would be bad design. However, if R were truly poorly designed on such a fundamental level, it would be much less likely to have achieved the level of popularity it has. So either a very improbable event has happened, or I am misunderstanding the practical significance/purpose of the factor data type.

Attempt: The one thing I could think of is the concept of "factor levels", whereby one can assign an ordering to factors (which one can't do for strings), which is helpful when describing "ordinal categorical variables", i.e. categorical variables with an order (e.g. "Low", "Medium", "High").

(Although even this wouldn't seem to make factors strictly necessary. Since the ordering is always linear, i.e. no true partial orders, on countable sets, we could always just accomplish the same with a map from some subset of the integers to the strings in question -- however in practice that would probably be a pain to implement over and over again, and a naive implementation would probably not be as efficient as the implementation of factors and factor levels built into R.)

However, not all categorical variables are ordinal, some are "nominal" (i.e. have no order). And yet "factors" and "factor levels" still seem to be used with these "nominal categorical variables". Why is this? I.e. what is the practical benefit to using factors instead of strings for such variables?

The only other information I could find on this subject is the following quote here:

Furthermore, storing string variables as factor variables is a more efficient use of memory.

What is the reason for this? Is this only true for "ordinal categorical variables", or is it also true for "nominal categorical variables"?

Related but different questions: These questions seem relevant, but don't specifically address the heart of my question -- namely, the difference between factors and strings, and why having such a difference is useful (from a programming perspective, not a statistical one).

Difference between ordered and unordered factor variables in R
Factors ordered vs. levels
Is there an advantage to ordering a categorical variable?
factor() command in R is for categorical variables with hierarchy level only?

558

asked Apr 15 '17 10:04

Chill2Macht

1 Answers

Practical differences:

If x is a string it can take any value. If x is a factor it can only take a values from a list of all levels. That makes these variables more memory effecient as well.

example:

> x <- factor(c("cat1","cat1","cat2"),levels = c("cat1","cat2") )
> x
[1] cat1 cat1 cat2
Levels: cat1 cat2
> x[3] <- "cat3"
Warning message:
In `[<-.factor`(`*tmp*`, 3, value = "cat3") :
  invalid factor level, NA generated
> x
[1] cat1 cat1 <NA>
Levels: cat1 cat2

As you said, you can have ordinal factors. Meaning that you can add extra information aout your variable that for instance level1 < level2 < level3. Characters don't have that. However, the order doesn't necessarily have to be linear, not sure where you found that.

answered Nov 14 '22 23:11

Vasilis Vasileiou

Related questions
                            
                                Extract dyRangeSelector values from dygraph in shiny app
                            
                                Computing new attribute for a list of multiple dataframes and unlists
                            
                                R Markdown code folding doesn't work with bash, Python code chunks
                            
                                Is it possible to Run R Code from Unity C# in Mono or .NET on OSX?
                            
                                Change class of variables in a data frame using another reference data frame
                            
                                Stacked bar plot in violin plot shape
                            
                                How to use serverside processing in DT::datatable?
                            
                                relative windowed running sum through data.table non-equi join
                            
                                Split Speaker and Dialogue in RStudio
                            
                                Adding an image to the title page of ioslides presentation
                            
                                How to format a table with counts, percents and marginal totals
                            
                                How to draw N random samples from a vector in R?
                            
                                Use `rpy2` with packages installed for `R` in conda virtual environment?
                            
                                R shiny: how to change values in a reactiveValues object
                            
                                Generate a sequence of time using R and lubridate
                            
                                Rscript running trouble with TMPDIR via plink -ssh on Windows
                            
                                R Plotting swell direction arrows with ggplot2 geom_spoke
                            
                                Formatting all columns using mutate_all in dplyr
                            
                                Rainbow legend in R
                            
                                Combine stack and dodge with bar plot in ggplot2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With