Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are factors stored more efficiently in data.table than characters?

Tags:

I though I had read somewhere (can't remember where) that factors were not actually more efficient than character vectors in data.table. Is this true? I was debating whether to continue using factors to store various vectors in data.table. An approximate test with object.size seems to indicate otherwise.

chars <- data.table(a = sample(letters, 1e5, TRUE))           # chars (not really) string <- data.table(a = sample(state.name, 1e5, TRUE))       # strings fact <- data.table(a = factor(sample(letters, 1e5, TRUE)))    # factor int <- data.table(a = sample(1:26, 1e5, TRUE))                # int  mbs <- function(...) {     ns <- sapply(match.call(expand.dots=TRUE)[-1L], deparse)     vals <- mget(ns, .GlobalEnv)     cat('Sizes:\n',         paste('\t', ns, ':', round(sapply(vals, object.size)/1024/1024, 3), 'MB\n')) }  ## Get approximate sizes? mbs(chars, string, fact, int) # Sizes: #    chars : 0.765 MB #    string : 0.766 MB #    fact : 0.384 MB #    int : 0.382 MB 
like image 547
Rorschach Avatar asked Jan 18 '16 19:01

Rorschach


People also ask

What is a factor in data?

Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like "Male, "Female" and True, False etc. They are useful in data analysis for statistical modeling.

When to use factors in R?

In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. Historically, factors were much easier to work with than characters.

What is the difference between integer and factor in R?

Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

How are R factors stored?

Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. The factor function is used to create a factor. The only required argument to factor is a vector of values which will be returned as a vector of factor values.


1 Answers

You may be remembering data.table FAQ 2.17 which contains :

stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.

(That part was added to the FAQ in v1.8.2 in July 2012.)

Using character rather than factor helps a lot in tasks like stacking (rbindlist). Since a c() of two character vectors is just the concatenation whereas a c() of two factor columns needs to traverse and union the two factor levels which is harder to code and takes longer to execute.

What you've noticed is a difference in RAM consumption on 64bit machines. Factors are stored as an integer vector lookup of the items in the levels. Type integer is 32bit, even on 64bit platforms. But pointers (what a character vector is) are 64bit on 64bit machines. So a character column will use twice as much RAM than a factor column on 64bit machine. No difference on 32bit. However, usually this cost will be outweighed by the simpler and faster instructions possible on a character vector. [Aside: since factors are integer they can't contain more than 2 billion unique strings. character columns don't have that limitation.]

It depends on what you're doing but operations have been optimized for character in data.table and so that's what we advise. Basically it saves a hop (to levels) and we can compare two character columns in different tables just by comparing the pointer values without hopping at all, even to the global cache.

It depends on the cardinality of the column, too. Say the column is 1 million rows and contains 1 million unique strings. Storing it as a factor will need a 1 million character vector for the levels plus a 1 million integer vector pointing to the level's elements. That's (4+8)*1e6 bytes. A character vector on the other hand won't need the levels and it's just 8*1e6 bytes. In both cases the global cache stores the 1 million unique strings in the same way so that happens anyway. In this case, the character column will use less RAM than if it were a factor. Careful to check that the memory tool used to calculate the RAM usage is calculating this appropriately.

like image 51
Matt Dowle Avatar answered Sep 18 '22 09:09

Matt Dowle