What is the purpose of setting a key in data.table?

What exactly does `setkey(DT, a, b)` do?

It does two things:

reorders the rows of the data.table DT by the column(s) provided (a, b) by reference, always in increasing order.
marks those columns as key columns by setting an attribute called sorted to DT.

The reordering is both fast (due to data.table's internal radix sorting) and memory efficient (only one extra column of type double is allocated).

When is `setkey()` required?

For grouping operations, setkey() was never an absolute requirement. That is, we can perform a cold-by or adhoc-by.

## "cold" by
require(data.table)
DT <- data.table(x=rep(1:5, each=2), y=1:10)
DT[, mean(y), by=x] # no key is set, order of groups preserved in result

However, prior to v1.9.6, joins of the form x[i] required key to be set on x. With the new on= argument from v1.9.6+, this is not true anymore, and setting keys is therefore not an absolute requirement here as well.

## joins using < v1.9.6 
setkey(X, a) # absolutely required
setkey(Y, a) # not absolutely required as long as 'a' is the first column
X[Y]

## joins using v1.9.6+
X[Y, on="a"]
# or if the column names are x_a and y_a respectively
X[Y, on=c("x_a" = "y_a")]

Note that on= argument can be explicitly specified even for keyed joins as well.

The only operation that requires key to be absolutely set is the foverlaps() function. But we are working on some more features which when done would remove this requirement.

So what's the reason for implementing on= argument?

There are quite a few reasons.
1. It allows to clearly distinguish the operation as an operation involving two data.tables. Just doing X[Y] does not distinguish this as well, although it could be clear by naming the variables appropriately.
2. It also allows to understand the columns on which the join/subset is being performed immediately by looking at that line of code (and not having to traceback to the corresponding setkey() line).
3. In operations where columns are added or updated by reference, on= operations are much more performant as it doesn't need the entire data.table to be reordered just to add/update column(s). For example,
```
## compare 
setkey(X, a, b) # why physically reorder X to just add/update a column?
X[Y, col := i.val]

## to
X[Y, col := i.val, on=c("a", "b")]
```
  In the second case, we did not have to reorder. It's not computing the order that's time consuming, but physically reordering the data.table in RAM, and by avoiding it, we retain the original order, and it is also performant.
4. Even otherwise, unless you're performing joins repetitively, there should be no noticeable performance difference between a keyed and ad-hoc joins.

This leads to the question, what advantage does keying a data.table have anymore?

Is there an advantage to keying a data.table?

Keying a data.table physically reorders it based on those column(s) in RAM. Computing the order is not usually the time consuming part, rather the reordering itself. However, once we've the data sorted in RAM, the rows belonging to the same group are all contiguous in RAM, and is therefore very cache efficient. It's the sortedness that speeds up operations on keyed data.tables.

It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.

In most cases therefore, there shouldn't be a need to set keys anymore. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.

Question: What do you think would be the performance like in comparison to a keyed join, if you use setorder() to reorder the data.table and use on=? If you've followed thus far, you should be able to figure it out :-).

A key is basically an index into a dataset, which allows for very fast and efficient sort, filter, and join operations. These are probably the best reasons to use data tables instead of data frames (the syntax for using data tables is also much more user friendly, but that has nothing to do with keys).

If you don't understand indexes, consider this: a phone book is "indexed" by name. So if I want to look up someone's phone number, it's pretty straightforward. But suppose I want to search by phone number (e.g., look up who has a particular phone number)? Unless I can "re-index" the phone book by phone number, it will take a very long time.

Consider the following example: suppose I have a table, ZIP, of all the zip codes in the US (>33,000) along with associated information (city, state, population, median income, etc.). If I want to look up the information for a specific zip code, the search (filter) is about 1000 times faster if I setkey(ZIP, zipcode) first.

Another benefit has to do with joins. Suppose a have a list of people and their zip codes in a data table (call it "PPL"), and I want to append information from the ZIP table (e.g. city, state, and so on). The following code will do it:

setkey(ZIP, zipcode)
setkey(PPL, zipcode)
full.info <- PPL[ZIP, nomatch = FALSE]

This is a "join" in the sense that I'm joining the information from 2 tables based in a common field (zipcode). Joins like this on very large tables are extremely slow with data frames, and extremely fast with data tables. In a real-life example I had to do more than 20,000 joins like this on a full table of zip codes. With data tables the script took about 20 min. to run. I didn't even try it with data frames because it would have taken more than 2 weeks.

IMHO you should not just read but study the FAQ and Intro material. It's easier to grasp if you have an actual problem to apply this to.

[Response to @Frank's comment]

Re: sorting vs. indexing - Based on the answer to this question, it appears that setkey(...) does in fact rearrange the columns in the table (e.g., a physical sort), and does not create an index in the database sense. This has some practical implications: for one thing if you set the key in a table with setkey(...) and then change any of the values in the key column, data.table merely declares the table to be no longer sorted (by turning off the sorted attribute); it does not dynamically re-index to maintain the proper sort order (as would happen in a database). Also, "removing the key" using setkey(DT, NULL) does not restore the table to it's original, unsorted order.

Re: filter vs. join - the practical difference is that filtering extracts a subset from a single dataset, whereas join combines data from two datasets based on a common field. There are many different kinds of join (inner, outer, left). The example above is an inner join (only records with keys common to both tables are returned), and this does have many similarities to filtering.

Related questions
                            
                                Geometric Mean: is there a built-in?
                            
                                Gradient of n colors ranging from color 1 and color 2
                            
                                Confusion between factor levels and factor labels
                            
                                Unique combination of all elements from two (or more) vectors
                            
                                Test for equality among all elements of a single numeric vector
                            
                                Export a graph to .eps file with R
                            
                                Breaking loop when "warnings()" appear in R
                            
                                Extract month and year from a zoo::yearmon object
                            
                                ggplot2 keep unused levels barplot
                            
                                Creating an R dataframe row-by-row
                            
                                Multiline Comment Workarounds?
                            
                                What's the difference between `=` and `<-` in R? [duplicate]
                            
                                How to include (source) R script in other scripts
                            
                                Where can I learn how to write C code to speed up slow R functions? [closed]
                            
                                `levels<-`( What sorcery is this?
                            
                                How do I import a CSV file in R? [closed]
                            
                                What is your preferred style for naming variables in R? [closed]
                            
                                Specifying colClasses in the read.csv
                            
                                Convert string to a variable name
                            
                                R script line numbers at error?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the purpose of setting a key in data.table?

Tags:

r

data.table

People also ask

What exactly does `setkey(DT, a, b)` do?

When is `setkey()` required?

Recent Activity

Donate For Us

What is the purpose of setting a key in data.table?

Tags:

r

data.table

People also ask

What exactly does setkey(DT, a, b) do?

When is setkey() required?

Related questions

Recent Activity

Donate For Us

What exactly does `setkey(DT, a, b)` do?

When is `setkey()` required?