Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does the order of keys in data.table matter?

Tags:

r

data.table

I have a data.table which has two keys: Year (10 levels) and MemberID (200,000 levels). When I setkey, does setkey(MemberID, Year) result in different performance compare with setkey(Year, MemberID)? If so, which way will be better?

like image 249
AdamNYC Avatar asked Dec 04 '12 00:12

AdamNYC


1 Answers

The performance and speed of the key setting will depend on the key variable types. numeric columns will be slower than integer. character columns (when short strings) appear to be fast.

eg

 library(data.table)

set.seed(1)
 DIC <- data.table(year = sample(seq_len(10), 5e6, TRUE), id = sample(as.character(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6))
DIC2 <- copy(DIC)
DIF <-  data.table(year = sample(seq_len(10), 5e6, TRUE), id = sample(as.factor(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6))
DIF2 <- copy(DIF)
DNC <- data.table(year = sample(as.numeric(seq_len(10)), 5e6, TRUE), id = sample(as.character(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6))
DNC2 <- copy(DNC)
DCC <- data.table(year = sample(as.character(seq_len(10)), 5e6, TRUE), id = sample(as.character(seq_len(2e5)), 5e6, TRUE), z = rnorm(5e6))
DCC2 <- copy(DCC)
 DII <- data.table(year = sample(seq_len(10), 5e6, TRUE), id = sample(seq_len(2e5), 5e6, TRUE), z = rnorm(5e6))
DII2 <- copy(DII)

Some timings

# key of integer, character columns
system.time(setkey(DIC, year ,id))
   user  system elapsed 
   3.21    0.11    3.31 
system.time(setkey(DIC2, id, year))
   user  system elapsed 
   3.43    0.03    3.45 
# key of integer factor columns
system.time(setkey(DIF, year ,id))
   user  system elapsed 
   6.31    0.05    6.37 
system.time(setkey(DIF2, id, year))
   user  system elapsed 
   6.44    0.06    6.54 
# key of numeric, character columns
system.time(setkey(DNC, year ,id))
   user  system elapsed 
   9.91    0.07   10.29 
system.time(setkey(DNC2, id, year))
   user  system elapsed 
  10.11    0.07   10.34 
# key of two character columns
system.time(setkey(DCC, year ,id))
   user  system elapsed 
   3.34    0.05    3.40 
system.time(setkey(DCC2, id, year))
   user  system elapsed 
   3.40    0.02    3.42 
# key of two integer columns
system.time(setkey(DII, year ,id))
   user  system elapsed 
   6.25    0.02    6.53 
system.time(setkey(DII2, id,year))
   user  system elapsed 
   6.44    0.05    6.64 

As to which way will be better. This will probably depend on what you are most likely to subset by alone more often.

For example, you may need to get all the data for year 1.

If you have set the key as year, id then you can use

D[J(1)]

but if the key was set as id, year then you would need

D[J(unique(id),1), nomatch = 0]

which is more typing and will take longer as it has to calculate unique(id) .

There is a feature request FR#1007 that looks at allowing a secondary key, but this is not implemented yet. Currently there is a single key that can occupy more than one column.

like image 144
mnel Avatar answered Oct 09 '22 08:10

mnel