Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does converting character columns to factors save memory?

Tags:

dataframe

r

I have a 2.5 GB dataset, which is quite large for my 4GB memory. I wonder if converting character variables to factors will save space and processing time.

I would imagine that internally, factors will be stored in numeric with a lookup table for levels. But I am not sure how it actually works.

like image 323
AdamNYC Avatar asked Nov 26 '12 17:11

AdamNYC


People also ask

Why change characters to factors in R?

Most statistical operations within R that can act on a character variable will essentially convert to a factor first. So, it's more efficient to convert characters to factors before passing them into these kinds of functions. This also gives us more control over what we're going to get.

Should I use character or factor in R?

In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. Historically, factors were much easier to work with than characters.

What does factor() do in R?

What is Factor in R? Factor in R is a variable used to categorize and store the data, having a limited number of different values. It stores the data as a vector of integer values. Factor in R is also known as a categorical variable that stores both string and integer data values as levels.


1 Answers

Converting to factor won't save space because characters are stored in a hash table. See section 1.10 The CHARSXP cache of R Internals.

Converting to factor may improve processing time if your code would need to convert to factor (running a regression, classification, etc.), but it won't improve processing time if you're doing string manipulation because it would have to convert the factor back to a character. So it really depends on what you're doing.

like image 84
Joshua Ulrich Avatar answered Oct 15 '22 20:10

Joshua Ulrich