Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table join (Error in vecseq) is key necessary on both on X and i?

Tags:

join

r

data.table

I am new to R and to data.table, which I find useful and fast. I am trying to join 2 data tables:

> TotFreq
        Legacy_Store_Number WeekDay       Date     Item_Key        Distr NoSellingDays meanUnits ItemType
     1:              113802       1 2013-03-24 000000000120 2.428985e-04             0      8.00       FM
     2:              113802       1 2013-03-24 000000000126 1.104030e-03             0     47.50       FM
     3:              113802       1 2013-03-24 000000000170 1.126004e-03             0     48.75       FM
     4:              113802       1 2013-03-24 000000000180 5.143034e-04             0     19.00       FM
     5:              113802       1 2013-03-24 000000000260 3.854306e-04             0     12.25       FM
160167:              113802       7 2013-03-23 978125002327 5.902655e-07            27      1.00       SM
160168:              113802       7 2013-03-23 978141970584 1.770796e-06            25      1.00       SM
160169:              113802       7 2013-03-23 978145300697 1.180531e-06            26      1.00       SM
160170:              113802       7 2013-03-23 978145552558 5.902655e-07            27      1.00       SM
160171:              113802       7 2013-03-23 978160139536 5.902655e-07            27      1.00       SM

> Count_SM_FM
    Legacy_Store_Number WeekDay ItemType ObjItems
 1:              113802       1       SM    12305
 2:              113802       1       FM     1942
 3:              113802       2       SM    11014
 4:              113802       2       FM     1398
 5:              113802       3       SM    10154
 6:              113802       3       FM     1117
 7:              113802       4       SM    10414
 8:              113802       4       FM     1167
 9:              113802       5       SM    10258
10:              113802       5       FM     1200
11:              113802       6       SM    11116
12:              113802       6       FM     1575
13:              113802       7       SM    13098
14:              113802       7       FM     2326
> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
> 
> ResultJoin <- TotFreq[Count_SM_FM]
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  : 
  Join results in 320342 rows; more than 160171 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

But I do not have duplicate keys in i!

Using:

> ResultJoin <- TotFreq[Count_SM_FM,allow.cartesian=T]
> 
> ResultJoin
        Legacy_Store_Number WeekDay       Date     Item_Key        Distr NoSellingDays meanUnits ItemType ItemType.1 ObjItems
     1:              113802       1 2013-03-24 000000000120 2.428985e-04             0      8.00       FM         SM    12305
     2:              113802       1 2013-03-24 000000000126 1.104030e-03             0     47.50       FM         SM    12305
     3:              113802       1 2013-03-24 000000000170 1.126004e-03             0     48.75       FM         SM    12305
     4:              113802       1 2013-03-24 000000000180 5.143034e-04             0     19.00       FM         SM    12305
     5:              113802       1 2013-03-24 000000000260 3.854306e-04             0     12.25       FM         SM    12305
    ---                                                                                                                      
320338:              113802       7 2013-03-23 978125002327 5.902655e-07            27      1.00       SM         FM     2326
320339:              113802       7 2013-03-23 978141970584 1.770796e-06            25      1.00       SM         FM     2326
320340:              113802       7 2013-03-23 978145300697 1.180531e-06            26      1.00       SM         FM     2326
320341:              113802       7 2013-03-23 978145552558 5.902655e-07            27      1.00       SM         FM     2326
320342:              113802       7 2013-03-23 978160139536 5.902655e-07            27      1.00       SM         FM     2326

I get in fact double the records that I had in my original TotFreq table. If I add a key also on Count_SM_FM the join works:

> setkey(TotFreq,Legacy_Store_Number,WeekDay,ItemType)
> setkey(Count_SM_FM,Legacy_Store_Number,WeekDay,ItemType)
> ResultJoin <- TotFreq[Count_SM_FM]
> 
> ResultJoin
        Legacy_Store_Number WeekDay ItemType       Date     Item_Key        Distr NoSellingDays meanUnits ObjItems
     1:              113802       1       FM 2013-03-24 000000000120 2.428985e-04             0      8.00     1942
     2:              113802       1       FM 2013-03-24 000000000126 1.104030e-03             0     47.50     1942
     3:              113802       1       FM 2013-03-24 000000000170 1.126004e-03             0     48.75     1942
     4:              113802       1       FM 2013-03-24 000000000180 5.143034e-04             0     19.00     1942
     5:              113802       1       FM 2013-03-24 000000000260 3.854306e-04             0     12.25     1942
    ---                                                                                                           
160167:              113802       7       SM 2013-03-23 978125002327 5.902655e-07            27      1.00    13098
160168:              113802       7       SM 2013-03-23 978141970584 1.770796e-06            25      1.00    13098
160169:              113802       7       SM 2013-03-23 978145300697 1.180531e-06            26      1.00    13098
160170:              113802       7       SM 2013-03-23 978145552558 5.902655e-07            27      1.00    13098
160171:              113802       7       SM 2013-03-23 978160139536 5.902655e-07            27      1.00    13098

I tried to verify with an example, maybe the problem was in not having the key variables as the first columns in TotFreq or in not having Count_SM_FM not sorted but I couldn't reproduce the error

> daysType <- data.table(
+     key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+     key2=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7),
+     key3=c("b","a","a","b","a","b","a","b","a","b","a","b","a","b"),
+     var1=c(2,4,6,8,4,5,7,3,7,9,6,3,5,6)
+ )        
> 
> 
> detailData <- data.table(
+     key1=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
+     key2=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,7,7),
+     var2=c(10,11,12,13,15,16,17,10,11,12,13,14,15,16,10,11,12,15,16,17,10,11,12,13,14,15,16,17,10,11,13,14,15,16,17,10,11,12,13,14,15,10,11,12,13,14,15,16,17),
+     var3=c(1,2,4,6,6,7,3,6,8,9,3,5,7,8,6,7,8,6,7,2,4,6,7,8,2,3,5,7,4,7,8,3,6,4,2,5,7,3,6,7,3,4,2,4,6,4,7,2,9),
+     key3=c("a","a","a","a","b","b","b","a","a","a","a","b","b","b","a","a","a","b","b","b","a","a","a","a","b","b","b","b","a","a","a","b","b","b","b","a","a","a","a","b","b","a","a","a","a","b","b","b","b")
+ )        
> 
> setkey(detailData,key1,key2,key3)
> JoinResult <- detailData[daysType]

The problem is different than that in question

Join of two data.tables fails

because there the allow.cartesian solves the issue.

What is the problem here? Why adding the key to Count_SM_FM solves it?

Thanks!

like image 217
user3645882 Avatar asked May 22 '14 14:05

user3645882


Video Answer


1 Answers

Update Oct 2014: Arun fixed it in v1.9.5 :

allow.cartesian is now ignored when i has no duplicates, #742 and #508. Thanks to @nigmastar, @user3645882 and others for the reports.



Previous answer ...

First let's address the allow.cartesian part. The error message should probably be changed to point out that you can get large sizes even if you don't have duplicates in i, but you have duplicates in the left hand side data.table. Here's a simple example:

dt1 = data.table(a = c(1,1), b = 1:2, key = 'a')
dt2 = data.table(a = c(1,2), c = 3:4)

dt1[dt2] # this gives an error, because join results in 3 rows, as seen below

dt1[dt2, allow.cartesian = TRUE]
#   a  b c
#1: 1  1 3
#2: 1  2 3
#3: 2 NA 4

Now as far as setting the key goes - no you don't need to set the key for i, it will just assume the first few columns are the keys. Looking at your first join result one can see that it was not joined on ItemType and that you're using an older data.table version (I'm using 1.9.3). So my guess is that either you didn't actually set the key correctly and didn't include ItemType or there was some bug in older versions that's been fixed since then.

like image 139
eddi Avatar answered Oct 17 '22 01:10

eddi