I just tried merging two tables in R 3.0.1 on a machine with 64G ram and got the following error. Help would be appreciated. (the data.table version is 1.8.8)
Here is what my code looks like:
library(parallel)
library(data.table)
data1: several million rows and 3 columns. The columns are tag
, prod
and v
. There are 750K unique values of tag
, anywhere from 1 to 1000 prod
s per tag
, 5000 possible values for prod
. v
takes any positive real value.
setkey(data1, tag)
merge (data1, data1, allow.cartesian=TRUE)
I get the following error:
Error in vecseq(f_, len_, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including
j
and droppingby
(by-without-by) so that j runs for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. Calls: merge -> merge.data.table -> [ -> [.data.table -> vecseq
country = fread("
country product share
1 5 .2
1 6 .2
1 7 .6
2 6 .3
2 7 .1
2 8 .4
2 9 .2
")
prod = fread("
prod period value
5 1990 2
5 1991 3
5 1992 2
5 1993 4
5 1994 3
5 1995 5
6 1990 1
6 1991 1
6 1992 0
6 1993 4
6 1994 8
6 1995 2
7 1990 3
7 1991 3
7 1992 3
7 1993 4
7 1994 7
7 1995 1
8 1990 2
8 1991 4
8 1992 2
8 1993 4
8 1994 2
8 1995 6
9 1990 1
9 1991 2
9 1992 4
9 1993 4
9 1994 5
9 1995 6
")
It seems entirely impossible to selected the subset of markets that share a country tag, find the covariances within those pairs, and collate those by country without running up against the size limit. Here is my best shot so far:
setkey(country,country)
setkey(prod, prod, period)
covars <- setkey(setkey(unique(country[country, allow.cartesian=T][, c("prod","prod.1"), with=F]),prod)[prod, allow.cartesian=T], prod.1, period)[prod, ] [ , list(pcov = cov(value,value.1)), by=list(prod,prod.1)] # really long oneliner that finds unique market pairs from the the self-join, merges it with the second table and calculates covariances from the merged table.
clevel <-setkey(country[country, allow.cartesian = T], prod, prod.1)[covars, nomatch=0][ , list(countryvar = sum(share*share.1*pcov)), by="country"]
> clevel
country countryvar
1: 1 2.858667
2: 2 1.869667
When I try this approach for any reasonable size of data, I run up against the vecseq error. It would be really nice if data.table did not balk so much at the 2^31 limit. I am a fan of the package. Suggestions on how I can use more of the j specification would also be appreciated. (I am not sure how else to try the J specification given how I have to compute variances from the the intersection of the two data tables).
R 3.0.1 supports objects having lengths greater than 2^31 - 1. While the packages that come with base R can already create such objects, whether contributed packages can do the same depends on the package. Basically, any package using compiled code would have to be recompiled and possibly modified to take advantage of this feature.
Also, assuming that 64GB RAM is enough to work with 60GB objects is kind of optimistic.
This join indeed seems to be misspecified. In general, I think, a self join of a table with single column key is probably always misspecified. Consider the following example :
> DT
A B
1: 1 5
2: 1 6
3: 1 7
4: 2 8
5: 2 9
> setkey(DT,A)
There are 2 unique values of A (1 and 2), but they are repeated in the A column. The key is a single column.
> DT[DT] # the long error message
> DT[DT, allow.cartesian=TRUE] # **each row** of DT is self joined to DT
A B B.1
1: 1 5 5
2: 1 6 5
3: 1 7 5
4: 1 5 6
5: 1 6 6
6: 1 7 6
7: 1 5 7
8: 1 6 7
9: 1 7 7
10: 2 8 8
11: 2 9 8
12: 2 8 9
13: 2 9 9
Is this really the result you need? More likely, the query needs to be changed by adding more columns to the key, doing a by
instead, not doing a self join or the hints in the error message.
More information about what you need to achieve after the merge (bigger picture) is likely to help.
The example now in question (covariance) is normally done with matrices rather than data.table. But proceeding anyway to quickly demonstrate, just using it as example data ...
require(data.table)
country = fread("
Country Product
1 5
1 6
1 7
2 6
2 7
2 8
2 9
")
prod = fread("
Prod1 Prod2 Covariance
5 5 .4
5 6 .5
5 7 .6
5 8 -.3
5 9 -.1
6 6 .3
6 7 .2
6 8 .4
6 9 -.2
7 7 .2
7 8 .1
7 9 .3
8 8 .1
8 9 .6
9 9 .01
")
.
country
Country Product
1: 1 5
2: 1 6
3: 1 7
4: 2 6
5: 2 7
6: 2 8
7: 2 9
prod
Prod1 Prod2 Covariance
1: 5 5 0.40
2: 5 6 0.50
3: 5 7 0.60
4: 5 8 -0.30
5: 5 9 -0.10
6: 6 6 0.30
7: 6 7 0.20
8: 6 8 0.40
9: 6 9 -0.20
10: 7 7 0.20
11: 7 8 0.10
12: 7 9 0.30
13: 8 8 0.10
14: 8 9 0.60
15: 9 9 0.01
.
setkey(country,Country)
country[country,{print(.SD);print(i.Product)}]
# included j to demonstrate j running for each row of i. Just printing to demo.
Product
1: 5
2: 6
3: 7
[1] 5
Product
1: 5
2: 6
3: 7
[1] 6
Product
1: 5
2: 6
3: 7
[1] 7
Product
1: 6
2: 7
3: 8
4: 9
[1] 6
Product
1: 6
2: 7
3: 8
4: 9
[1] 7
Product
1: 6
2: 7
3: 8
4: 9
[1] 8
Product
1: 6
2: 7
3: 8
4: 9
[1] 9
Empty data.table (0 rows) of 2 cols: Country,Product
.
setkey(prod,Prod1,Prod2)
country[country,prod[J(i.Product,Product),Covariance,nomatch=0]]
Country Prod1 Prod2 Covariance
1: 1 5 5 0.40
2: 1 5 6 0.50
3: 1 5 7 0.60
4: 1 6 6 0.30
5: 1 6 7 0.20
6: 1 7 7 0.20
7: 2 6 6 0.30
8: 2 6 7 0.20
9: 2 6 8 0.40
10: 2 6 9 -0.20
11: 2 7 7 0.20
12: 2 7 8 0.10
13: 2 7 9 0.30
14: 2 8 8 0.10
15: 2 8 9 0.60
16: 2 9 9 0.01
country[country, prod[J(i.Product,Product),Covariance,nomatch=0][
,mean(Covariance),by=Country]
Country V1
1: 1 0.3666667
2: 2 0.2010000
This doesn't match the desired result due to not doubling the off diagonal. But hopefully this is enough to demonstrate that particular suggestion in the error message in the question and you can take it from here. Or use matrix
rather than data.table
for covariance type work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With