Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why "character is often preferred to factor" in data.table for key?

Tags:

r

data.table

From data.table manual:

In fact we like it so much that data.table contains a counting sort algorithm for character vectors using R’s internal global string cache. This is particularly fast for character vectors containing many duplicates, such as grouped data in a key column. This means that character is often preferred to factor. Factors are still fully supported, in particular ordered factors (where the levels are not in alphabetic order).

Isn't factor just integer which should be easier to do counting sort than character?

like image 250
colinfang Avatar asked Aug 18 '13 23:08

colinfang


1 Answers

Isn't factor just integer which should be easier to do counting sort than character?

Yes, if you're given a factor already. But the time to create that factor can be significant and that's what setkey (and ad hoc by) aim to beat. Try timing factor() on a randomly ordered character vector, say 1e6 long with 1e4 levels. Then compare to setkey or ad hoc by on the original randomly ordered character vector.

agstudy's comment is correct too; i.e., character vectors (being pointers to R cached strings) are quite similar to factors anyway. On 32bit systems character vectors are the same size as the factor's integer vector but the factor has the levels attribute to store (and sometimes copy) too. On 64bit systems the pointers are twice as big. But on the other hand R's string cache can be looked up directly from character vector pointers, whereas the factor has an extra hop via levels. (The levels attribute is a character vector of R string cache pointers too.)

like image 140
Matt Dowle Avatar answered Oct 28 '22 20:10

Matt Dowle