why "character is often preferred to factor" in data.table for key?

Question

From data.table manual:

In fact we like it so much that data.table contains a counting sort algorithm for character vectors using R’s internal global string cache. This is particularly fast for character vectors containing many duplicates, such as grouped data in a key column. This means that character is often preferred to factor. Factors are still fully supported, in particular ordered factors (where the levels are not in alphabetic order).

Isn't factor just integer which should be easier to do counting sort than character?

Matt Dowle · Accepted Answer

Isn't factor just integer which should be easier to do counting sort than character?

Yes, if you're given a factor already. But the time to create that factor can be significant and that's what setkey (and ad hoc by) aim to beat. Try timing factor() on a randomly ordered character vector, say 1e6 long with 1e4 levels. Then compare to setkey or ad hoc by on the original randomly ordered character vector.

agstudy's comment is correct too; i.e., character vectors (being pointers to R cached strings) are quite similar to factors anyway. On 32bit systems character vectors are the same size as the factor's integer vector but the factor has the levels attribute to store (and sometimes copy) too. On 64bit systems the pointers are twice as big. But on the other hand R's string cache can be looked up directly from character vector pointers, whereas the factor has an extra hop via levels. (The levels attribute is a character vector of R string cache pointers too.)

why "character is often preferred to factor" in data.table for key?

Tags:

r

data.table

colinfang

1 Answers

Matt Dowle

Recent Activity

Donate For Us

why "character is often preferred to factor" in data.table for key?

Tags:

r

data.table

colinfang

1 Answers

Matt Dowle

Related questions

Recent Activity

Donate For Us