I have two data.tables: DT
and meta
. When I merge them using DT[meta]
, memory usage increases by more than 10 GB (and the merge is very slow). What's going wrong? It seems like the merge is successful, but I can only look at single lines, otherwise I run out of memory. DT
itself was created by merging two data.tables without any problems.
Edit:
It seems to be a problem with the key. I can do the following without a problem:
DT[,id:=1:nrow(DT)]
meta[,id:=1:nrow(DT)]
setkey(DT,id)
setkey(meta,id)
DT2<-DT[meta] # Comment from Matthew Dowle:
# X[Y] (or merge) on a key of 1:nrow(DT) is just a cbind, isn't it?
unique(DT2[,"Moor_ID",with=F]==DT2[,"Moor_ID.1",with=F])
Moor_ID
[1,] TRUE
First data.table:
str(DT)
Classes ‘data.table’ and 'data.frame': 10212 obs. of 55 variables:
$ DWD_ID : chr "Bremerhav" "Bremerhav" "Bremerhav" "Bremerhav" ...
$ numdays : int 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 ...
$ days : Date, format: "2009-09-01" "2009-09-02" "2009-09-03" "2009-09-04" ...
$ TBoden_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ TBoden_dayMin : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMax : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanAR : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayStabw : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMedian : num NA NA NA NA NA NA NA NA NA NA ...
$ TBoden_dayMeanMM : num NA NA NA NA NA NA NA NA NA NA ...
$ T2m_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ T2m_dayMin : num 15.6 13.8 13.7 12.8 13.5 13.1 13.3 13.8 15.9 13.7 ...
$ T2m_dayMax : num 25.6 19.9 18.1 18.1 16.9 18.6 21 25.7 19.3 17.6 ...
$ T2m_dayMeanAR : num 19 16.9 15.6 15.2 14.8 ...
$ T2m_dayStabw : num 3.409 2.048 1.334 1.726 0.965 ...
$ T2m_dayMedian : num 17.2 16.8 15.2 14.8 14.5 ...
$ T2m_dayMeanMM : num 20.6 16.9 15.9 15.4 15.2 ...
$ T10cm_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ T10cm_dayMin : num 14.3 12.6 12.9 12.2 12.7 12 12.8 11.7 15.1 12.2 ...
$ T10cm_dayMax : num 27.7 20.9 18.7 18.7 17.4 19.8 22.4 25.9 21.8 18.6 ...
$ T10cm_dayMeanAR : num 18.7 16.5 14.9 15.1 14.5 ...
$ T10cm_dayStabw : num 4.36 2.84 1.73 2.36 1.54 ...
$ T10cm_dayMedian : num 16.1 15.6 14.3 14.2 14 ...
$ T10cm_dayMeanMM : num 21 16.8 15.8 15.4 15.1 ...
$ RF_dayAnzahl : int 0 0 0 0 0 0 0 0 0 0 ...
$ RF_dayMin : num 45 58 73 56 68 62 63 44 65 58 ...
$ RF_dayMax : num 94 94 94 93 94 92 84 84 89 84 ...
$ RF_dayMean : num 68.6 76.3 78.9 74.4 86.5 ...
$ RF_dayStabw : num 17.09 12.53 5.88 9.83 5.62 ...
$ RF_dayMedian : num 64.5 74 77.5 76 87.5 77.5 75 63 77 76 ...
$ Luftdruck_dayMean : num 100.8 101 99.7 99.9 101.1 ...
$ es_day : num 2.53 1.95 1.82 1.78 1.74 ...
$ ea_day : num 1.57 1.42 1.49 1.27 1.38 ...
$ defi_day : num 0.956 0.535 0.327 0.509 0.355 ...
$ Nebel_dayAnteil : num 0 0 0 0 0 0 0 0 0 0 ...
$ Sonnenscheind_dayAnzahl: int 18 18 18 18 18 18 18 18 18 18 ...
$ Sonnenscheind_daySum : num 6.63 4.93 1.05 5.82 3.27 ...
$ julian_day : int 244 245 246 247 248 249 250 251 252 253 ...
$ zeta_day : num 2.81 2.82 2.84 2.86 2.88 ...
$ maxSonnenscheind : num 13.9 13.8 13.7 13.6 13.5 ...
$ R0_day : num 2920 2890 2860 2830 2799 ...
$ Globalstrahlung_dayMean: num NA NA NA NA NA NA NA NA NA NA ...
$ RG_day : num 13.24 11.19 6.64 12.02 9.03 ...
$ lambdaET_day : num 2.45 2.46 2.46 2.46 2.47 ...
$ sAnstieg_day : num 0.15 0.122 0.116 0.113 0.111 ...
$ gamma_day : num 0.067 0.0669 0.0659 0.0661 0.0668 ...
$ ETp_TW_day : num 2.71 2.15 1.28 2.24 1.68 ...
$ Moor_ID : chr "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
$ Distanz_in_km : num 24 24 24 24 24 ...
$ North : num 53.5 53.5 53.5 53.5 53.5 ...
$ East : num 8.58 8.58 8.58 8.58 8.58 ...
$ Hoehe_in_m : num 7 7 7 7 7 7 7 7 7 7 ...
$ Kueste_km : num 20 20 20 20 20 ...
$ peatland : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ diffmaxt2m : num -1.6 0 -0.2 0.1 -0.4 ...
- attr(*, "sorted")= chr "Moor_ID"
- attr(*, ".internal.selfref")=<externalptr>
Second data.table:
str(meta)
Classes ‘data.table’ and 'data.frame': 10212 obs. of 6 variables:
$ Moor_ID : chr "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" "Ahlenmoor" ...
$ Hoehe_Moor : num 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 2.35 ...
$ Kueste_km : num 15.7 15.7 15.7 15.7 15.7 ...
$ WSPsommer_muGOK: num 0.699 0.699 0.699 0.699 0.699 ...
$ WSPwinter_muGOK: num 0.446 0.446 0.446 0.446 0.446 ...
$ Moorgroesse_km2: num 59 59 59 59 59 59 59 59 59 59 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "Moor_ID"
Session info:
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] grDevices datasets splines graphics stats tcltk utils methods base
other attached packages:
[1] reshape_0.8.4 plyr_1.7.1 data.table_1.8.0 svSocket_0.9-53 TinnR_1.0-5 R2HTML_2.2 Hmisc_3.9-3
[8] survival_2.36-14
loaded via a namespace (and not attached):
[1] cluster_1.14.2 grid_2.15.1 lattice_0.20-6 svMisc_0.9-65 tools_2.15.1
My bad. The problem was that keys were not unique:
a<-data.table(x=c(1,1),y=c(1,2))
b<-data.table(x=c(1,1),y=c(3,4))
setkey(a,x)
setkey(b,x)
a[b]
x y y.1
[1,] 1 1 3
[2,] 1 2 3
[3,] 1 1 4
[4,] 1 2 4
It would be nice if data.table could give a warning for that.
Update from Matthew
This warning has now been implemented in v1.8.7 :
New argument
allow.cartesian
( defaultFALSE
) added toX[Y]
andmerge(X,Y)
, #2464. Prevents large allocations due to misspecified joins; e.g., duplicate key values inY
joining to the same group inX
over and over again. The word cartesian is used loosely for when more thanmax(nrow(X),nrow(Y))
rows would be returned. The error message is verbose and includes advice.
Maybe others functions can work better, like merge()
or cbind()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With