I have a 2.5 GB dataset, which is quite large for my 4GB memory. I wonder if converting character variables to factors will save space and processing time.
I would imagine that internally, factors will be stored in numeric with a lookup table for levels. But I am not sure how it actually works.
Most statistical operations within R that can act on a character variable will essentially convert to a factor first. So, it's more efficient to convert characters to factors before passing them into these kinds of functions. This also gives us more control over what we're going to get.
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. Historically, factors were much easier to work with than characters.
What is Factor in R? Factor in R is a variable used to categorize and store the data, having a limited number of different values. It stores the data as a vector of integer values. Factor in R is also known as a categorical variable that stores both string and integer data values as levels.
Converting to factor won't save space because characters are stored in a hash table. See section 1.10 The CHARSXP cache of R Internals.
Converting to factor may improve processing time if your code would need to convert to factor (running a regression, classification, etc.), but it won't improve processing time if you're doing string manipulation because it would have to convert the factor back to a character. So it really depends on what you're doing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With