Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join results in more than 2^31 rows (internal vecseq reached physical limit)

I just tried merging two tables in R 3.0.1 on a machine with 64G ram and got the following error. Help would be appreciated. (the data.table version is 1.8.8)

Here is what my code looks like:

library(parallel)
library(data.table)

data1: several million rows and 3 columns. The columns are tag, prod and v. There are 750K unique values of tag, anywhere from 1 to 1000 prods per tag, 5000 possible values for prod. v takes any positive real value.

setkey(data1, tag)
merge (data1, data1, allow.cartesian=TRUE)

I get the following error:

Error in vecseq(f_, len_, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. Calls: merge -> merge.data.table -> [ -> [.data.table -> vecseq

new example showing by-without-by

country = fread("
country product share
1 5 .2
1 6 .2
1 7 .6
2 6 .3
2 7 .1
2 8 .4
2 9 .2
")
prod = fread("
prod period value
5 1990 2
5 1991 3
5 1992 2
5 1993 4
5 1994 3
5 1995 5
6 1990 1
6 1991 1
6 1992 0
6 1993 4
6 1994 8
6 1995 2
7 1990 3
7 1991 3
7 1992 3
7 1993 4
7 1994 7
7 1995 1
8 1990 2
8 1991 4
8 1992 2
8 1993 4
8 1994 2
8 1995 6
9 1990 1
9 1991 2
9 1992 4
9 1993 4
9 1994 5
9 1995 6
")

It seems entirely impossible to selected the subset of markets that share a country tag, find the covariances within those pairs, and collate those by country without running up against the size limit. Here is my best shot so far:

setkey(country,country)
setkey(prod, prod, period)
covars <- setkey(setkey(unique(country[country, allow.cartesian=T][, c("prod","prod.1"), with=F]),prod)[prod, allow.cartesian=T], prod.1, period)[prod, ] [ , list(pcov = cov(value,value.1)), by=list(prod,prod.1)] # really long oneliner that finds unique market pairs from the the self-join, merges it with the second table and calculates covariances from the merged table.
clevel <-setkey(country[country, allow.cartesian = T], prod, prod.1)[covars, nomatch=0][ , list(countryvar = sum(share*share.1*pcov)), by="country"]
> clevel
   country countryvar
1:       1   2.858667
2:       2   1.869667

When I try this approach for any reasonable size of data, I run up against the vecseq error. It would be really nice if data.table did not balk so much at the 2^31 limit. I am a fan of the package. Suggestions on how I can use more of the j specification would also be appreciated. (I am not sure how else to try the J specification given how I have to compute variances from the the intersection of the two data tables).

like image 617
user2627717 Avatar asked Aug 07 '13 11:08

user2627717


2 Answers

R 3.0.1 supports objects having lengths greater than 2^31 - 1. While the packages that come with base R can already create such objects, whether contributed packages can do the same depends on the package. Basically, any package using compiled code would have to be recompiled and possibly modified to take advantage of this feature.

Also, assuming that 64GB RAM is enough to work with 60GB objects is kind of optimistic.

like image 114
Hong Ooi Avatar answered Nov 15 '22 00:11

Hong Ooi


This join indeed seems to be misspecified. In general, I think, a self join of a table with single column key is probably always misspecified. Consider the following example :

> DT
   A B
1: 1 5
2: 1 6
3: 1 7
4: 2 8
5: 2 9
> setkey(DT,A)

There are 2 unique values of A (1 and 2), but they are repeated in the A column. The key is a single column.

> DT[DT]   # the long error message

> DT[DT, allow.cartesian=TRUE]  # **each row** of DT is self joined to DT
    A B B.1
 1: 1 5   5
 2: 1 6   5
 3: 1 7   5
 4: 1 5   6
 5: 1 6   6
 6: 1 7   6
 7: 1 5   7
 8: 1 6   7
 9: 1 7   7
10: 2 8   8
11: 2 9   8
12: 2 8   9
13: 2 9   9

Is this really the result you need? More likely, the query needs to be changed by adding more columns to the key, doing a by instead, not doing a self join or the hints in the error message.

More information about what you need to achieve after the merge (bigger picture) is likely to help.

example of "including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation" (see error message in question)

The example now in question (covariance) is normally done with matrices rather than data.table. But proceeding anyway to quickly demonstrate, just using it as example data ...

require(data.table)
country = fread("
Country Product
1 5
1 6
1 7
2 6
2 7
2 8
2 9
")
prod = fread("
Prod1 Prod2 Covariance
5 5 .4
5 6 .5
5 7 .6
5 8 -.3
5 9 -.1
6 6 .3
6 7 .2
6 8 .4
6 9 -.2
7 7 .2
7 8 .1
7 9 .3
8 8 .1
8 9 .6
9 9 .01
")

.

country
   Country Product
1:       1       5
2:       1       6
3:       1       7
4:       2       6
5:       2       7
6:       2       8
7:       2       9
prod
    Prod1 Prod2 Covariance
 1:     5     5       0.40
 2:     5     6       0.50
 3:     5     7       0.60
 4:     5     8      -0.30
 5:     5     9      -0.10
 6:     6     6       0.30
 7:     6     7       0.20
 8:     6     8       0.40
 9:     6     9      -0.20
10:     7     7       0.20
11:     7     8       0.10
12:     7     9       0.30
13:     8     8       0.10
14:     8     9       0.60
15:     9     9       0.01

.

setkey(country,Country)
country[country,{print(.SD);print(i.Product)}]
# included j to demonstrate j running for each row of i. Just printing to demo.
   Product
1:       5
2:       6
3:       7
[1] 5
   Product
1:       5
2:       6
3:       7
[1] 6
   Product
1:       5
2:       6
3:       7
[1] 7
   Product
1:       6
2:       7
3:       8
4:       9
[1] 6
   Product
1:       6
2:       7
3:       8
4:       9
[1] 7
   Product
1:       6
2:       7
3:       8
4:       9
[1] 8
   Product
1:       6
2:       7
3:       8
4:       9
[1] 9
Empty data.table (0 rows) of 2 cols: Country,Product

.

setkey(prod,Prod1,Prod2)
country[country,prod[J(i.Product,Product),Covariance,nomatch=0]]
    Country Prod1 Prod2 Covariance
 1:       1     5     5       0.40
 2:       1     5     6       0.50
 3:       1     5     7       0.60
 4:       1     6     6       0.30
 5:       1     6     7       0.20
 6:       1     7     7       0.20
 7:       2     6     6       0.30
 8:       2     6     7       0.20
 9:       2     6     8       0.40
10:       2     6     9      -0.20
11:       2     7     7       0.20
12:       2     7     8       0.10
13:       2     7     9       0.30
14:       2     8     8       0.10
15:       2     8     9       0.60
16:       2     9     9       0.01

country[country, prod[J(i.Product,Product),Covariance,nomatch=0][
    ,mean(Covariance),by=Country]
   Country        V1
1:       1 0.3666667
2:       2 0.2010000

This doesn't match the desired result due to not doubling the off diagonal. But hopefully this is enough to demonstrate that particular suggestion in the error message in the question and you can take it from here. Or use matrix rather than data.table for covariance type work.

like image 3
Matt Dowle Avatar answered Nov 15 '22 01:11

Matt Dowle