I am new to R and to data.table
, which I find useful and fast. I am trying to join 2 data tables:
> TotFreq
Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType
1: 113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM
2: 113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM
3: 113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM
4: 113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM
5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM
160167: 113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM
160168: 113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM
160169: 113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM
160170: 113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM
160171: 113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM
> Count_SM_FM
Legacy_Store_Number WeekDay ItemType ObjItems
1: 113802 1 SM 12305
2: 113802 1 FM 1942
3: 113802 2 SM 11014
4: 113802 2 FM 1398
5: 113802 3 SM 10154
6: 113802 3 FM 1117
7: 113802 4 SM 10414
8: 113802 4 FM 1167
9: 113802 5 SM 10258
10: 113802 5 FM 1200
11: 113802 6 SM 11116
12: 113802 6 FM 1575
13: 113802 7 SM 13098
14: 113802 7 FM 2326
> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
>
> ResultJoin <- TotFreq[Count_SM_FM]
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 320342 rows; more than 160171 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
But I do not have duplicate keys in i
!
Using:
> ResultJoin <- TotFreq[Count_SM_FM,allow.cartesian=T]
>
> ResultJoin
Legacy_Store_Number WeekDay Date Item_Key Distr NoSellingDays meanUnits ItemType ItemType.1 ObjItems
1: 113802 1 2013-03-24 000000000120 2.428985e-04 0 8.00 FM SM 12305
2: 113802 1 2013-03-24 000000000126 1.104030e-03 0 47.50 FM SM 12305
3: 113802 1 2013-03-24 000000000170 1.126004e-03 0 48.75 FM SM 12305
4: 113802 1 2013-03-24 000000000180 5.143034e-04 0 19.00 FM SM 12305
5: 113802 1 2013-03-24 000000000260 3.854306e-04 0 12.25 FM SM 12305
---
320338: 113802 7 2013-03-23 978125002327 5.902655e-07 27 1.00 SM FM 2326
320339: 113802 7 2013-03-23 978141970584 1.770796e-06 25 1.00 SM FM 2326
320340: 113802 7 2013-03-23 978145300697 1.180531e-06 26 1.00 SM FM 2326
320341: 113802 7 2013-03-23 978145552558 5.902655e-07 27 1.00 SM FM 2326
320342: 113802 7 2013-03-23 978160139536 5.902655e-07 27 1.00 SM FM 2326
I get in fact double the records that I had in my original TotFreq
table. If I add a key also on Count_SM_FM
the join works:
> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
> setkey(Count_SM_FM,Legacy_Store_Number,WeekDay,ItemType)
> ResultJoin <- TotFreq[Count_SM_FM]
>
> ResultJoin
Legacy_Store_Number WeekDay ItemType Date Item_Key Distr NoSellingDays meanUnits ObjItems
1: 113802 1 FM 2013-03-24 000000000120 2.428985e-04 0 8.00 1942
2: 113802 1 FM 2013-03-24 000000000126 1.104030e-03 0 47.50 1942
3: 113802 1 FM 2013-03-24 000000000170 1.126004e-03 0 48.75 1942
4: 113802 1 FM 2013-03-24 000000000180 5.143034e-04 0 19.00 1942
5: 113802 1 FM 2013-03-24 000000000260 3.854306e-04 0 12.25 1942
---
160167: 113802 7 SM 2013-03-23 978125002327 5.902655e-07 27 1.00 13098
160168: 113802 7 SM 2013-03-23 978141970584 1.770796e-06 25 1.00 13098
160169: 113802 7 SM 2013-03-23 978145300697 1.180531e-06 26 1.00 13098
160170: 113802 7 SM 2013-03-23 978145552558 5.902655e-07 27 1.00 13098
160171: 113802 7 SM 2013-03-23 978160139536 5.902655e-07 27 1.00 13098
I tried to verify with an example, maybe the problem was in not having the key variables as the first columns in TotFreq
or in not having Count_SM_FM
not sorted but I couldn't reproduce the error
> daysType <- data.table(
+ key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+ key2=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7),
+ key3=c("b","a","a","b","a","b","a","b","a","b","a","b","a","b"),
+ var1=c(2,4,6,8,4,5,7,3,7,9,6,3,5,6)
+ )
>
>
> detailData <- data.table(
+ key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+ key2=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,7,7),
+ var2=c(10,11,12,13,15,16,17,10,11,12,13,14,15,16,10,11,12,15,16,17,10,11,12,13,14,15,16,17,10,11,13,14,15,16,17,10,11,12,13,14,15,10,11,12,13,14,15,16,17),
+ var3=c(1,2,4,6,6,7,3,6,8,9,3,5,7,8,6,7,8,6,7,2,4,6,7,8,2,3,5,7,4,7,8,3,6,4,2,5,7,3,6,7,3,4,2,4,6,4,7,2,9),
+ key3=c("a","a","a","a","b","b","b","a","a","a","a","b","b","b","a","a","a","b","b","b","a","a","a","a","b","b","b","b","a","a","a","b","b","b","b","a","a","a","a","b","b","a","a","a","a","b","b","b","b")
+ )
>
> setkey(detailData,key1,key2,key3)
> JoinResult <- detailData[daysType]
The problem is different than that in question
Join of two data.tables fails
because there the allow.cartesian
solves the issue.
What is the problem here? Why adding the key to Count_SM_FM
solves it?
Thanks!
Update Oct 2014: Arun fixed it in v1.9.5 :
allow.cartesian
is now ignored wheni
has no duplicates, #742 and #508. Thanks to @nigmastar, @user3645882 and others for the reports.
Previous answer ...
First let's address the allow.cartesian
part. The error message should probably be changed to point out that you can get large sizes even if you don't have duplicates in i
, but you have duplicates in the left hand side data.table
. Here's a simple example:
dt1 = data.table(a = c(1,1), b = 1:2, key = 'a')
dt2 = data.table(a = c(1,2), c = 3:4)
dt1[dt2] # this gives an error, because join results in 3 rows, as seen below
dt1[dt2, allow.cartesian = TRUE]
# a b c
#1: 1 1 3
#2: 1 2 3
#3: 2 NA 4
Now as far as setting the key goes - no you don't need to set the key for i
, it will just assume the first few columns are the keys. Looking at your first join result one can see that it was not joined on ItemType
and that you're using an older data.table
version (I'm using 1.9.3). So my guess is that either you didn't actually set the key correctly and didn't include ItemType
or there was some bug in older versions that's been fixed since then.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With