Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is allow.cartesian required at times when when joining data.tables with duplicate keys?

Tags:

r

data.table

I am trying to understand the logic of J() lookup when there're duplicate keys in a data.table in R.

Here's a little experiment I have tried:

library(data.table)
options(stringsAsFactors = FALSE)

x <- data.table(keyVar = c("a", "b", "c", "c"),
            value  = c(  1,   2,   3,   4))
setkey(x, keyVar)

y1 <- data.frame(name = c("d", "c", "a"))
x[J(y1$name), ]
## OK

y2 <- data.frame(name = c("d", "c", "a", "b"))
x[J(y2$name), ]
## Error: see below

x2 <- data.table(keyVar = c("a", "b", "c"),
                 value  = c(  1,   2,   3))
setkey(x2, keyVar)
x2[J(y2$name), ]
## OK

The error message I am getting is :

Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  :
Join results in 5 rows; more than 4 = max(nrow(x),nrow(i)). Check for duplicate key
values in i, each of which join to the same group in x over and over again. If that's
ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group
to avoid the large allocation. If you are sure you wish to proceed, rerun with 
allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, 
Stack Overflow and datatable-help for advice.

I don't really understand this. I know I should avoid duplicate keys in a lookup function, I just want to gain some insight so I won't make any error in the future.

Thanks a ton for help. This is a great tool.

like image 821
yuez Avatar asked Apr 15 '14 14:04

yuez


People also ask

What does allow Cartesian do?

When you've duplicate keys, the resulting join can sometimes get much bigger. Since data. table knows the total number of rows that'll result from this join early enough, it provides this error message and asks you to use the argument allow. cartesian=TRUE if you're really sure.

How do I merge two data tables in R?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.


1 Answers

You don't have to avoid duplicate keys. As long as the result does not get bigger than max(nrow(x), nrow(i)), you won't get this error, even if you've duplicates. It is basically a precautionary measure.

When you've duplicate keys, the resulting join can sometimes get much bigger. Since data.table knows the total number of rows that'll result from this join early enough, it provides this error message and asks you to use the argument allow.cartesian=TRUE if you're really sure.

Here's an (exaggerated) example that illustrates the idea behind this error message:

require(data.table) DT1 <- data.table(x=rep(letters[1:2], c(1e2, 1e7)),                    y=1L, key="x") DT2 <- data.table(x=rep("b", 3), key="x")  # not run # DT1[DT2] ## error  dim(DT1[DT2, allow.cartesian=TRUE]) # [1] 30000000        2 

The duplicates in DT2 resulted in 3 times the total number of "a" in DT1 (=1e7). Imagine if you performed the join with 1e4 values in DT2, the results would explode! To avoid this, there's the allow.cartesian argument which by default is FALSE.

That being said, I think Matt once mentioned that it maybe possible to just provide the error in case of "large" joins (or joins that results in huge number of rows - which might be set arbitrarily I guess). This, when/if implemented, will make the join properly without this error message in case of joins that don't combinatorially explode.

like image 128
Arun Avatar answered Sep 25 '22 15:09

Arun