I have two fairly large data.table
objects that I want to merge.
dt1
has 500.000.000 observations on 5 columns. dt2
has 300.000 observations on 2 columns.Both objects have the same key
called id
.
I want to left_join
information from dt2
into dt1
.
For example:
dt1 <- data.table(id = c(1, 2, 3, 4),
x1 = c(12, 13, 14, 15),
x2 = c(5, 6, 7, 8),
x3 = c(33, 44, 55, 66),
x4 = c(123, 123, 123, 123))
dt2 <- data.table(id = c(1, 2, 3, 4),
x5 = c(555, 666, 777, 888))
setkey(dt1, id)
setkey(dt2, id)
dt2[dt1, on="id"]
> dt2[dt1, on="id"]
id x5 x1 x2 x3 x4
1: 1 555 12 5 33 123
2: 2 666 13 6 44 123
3: 3 777 14 7 55 123
4: 4 888 15 8 66 123
However, when merging my original data R can't allocate memory anymore. Yet, the output of the merge fits in the RAM.
What is the most efficient (speed vs. memory limitations) way of getting this large merge done?
Should we split-apply-combine?
Should we use a DB library to get this done?
How would you do this efficiently?
Keyed assignment should save memory.
dt1[dt2, on = "id", x5 := x5]
Should we use a DB library to get this done?
That's probably a good idea. If setting up and using a database is painful for you, try the RSQLite
package. It's pretty simple.
My experiment
tl;dr: 55% less memory used by keyed assignment compared to merge-and-replace, for a toy example.
I wrote two scripts that each sourced a setup script, dt-setup.R
to create dt1
and dt2
. The first script, dt-merge.R
, updated dt1
through the "merge" method. The second, dt-keyed-assign.R
, used keyed assignment. Both scripts recorded memory allocations using the Rprofmem()
function.
To not torture my laptop, I'm having dt1
be 500,000 rows and dt2
3,000 rows.
Scripts:
# dt-setup.R
library(data.table)
set.seed(9474)
id_space <- seq_len(3000)
dt1 <- data.table(
id = sample(id_space, 500000, replace = TRUE),
x1 = runif(500000),
x2 = runif(500000),
x3 = runif(500000),
x4 = runif(500000)
)
dt2 <- data.table(
id = id_space,
x5 = 11 * id_space
)
setkey(dt1, id)
setkey(dt2, id)
# dt-merge.R
source("dt-setup.R")
Rprofmem(filename = "dt-merge.out")
dt1 <- dt2[dt1, on = "id"]
Rprofmem(NULL)
# dt-keyed-assign.R
source("dt-setup.R")
Rprofmem(filename = "dt-keyed-assign.out")
dt1[dt2, on = "id", x5 := x5]
Rprofmem(NULL)
With all three scripts in my working directory, I ran each of the joining scripts in a separate R process.
system2("Rscript", "dt-merge.R")
system2("Rscript", "dt-keyed-assign.R")
I think the lines in the output files generally follow the pattern "<bytes> :<call stack>"
. I haven't found good documentation for this. However, the numbers in the front were never below 128, and this is the default minimum number of bytes below which R does not malloc
for vectors.
Note that not all of these allocations add to the total memory used by R. R might reuse some memory it already has after a garbage collection. So it's not a good way to measure how much memory is used at any specific time. However, if we assume garbage collection behavior is independent, it does work as a comparison between scripts.
Some sample lines of the memory report:
cat(readLines("dt-merge.out", 5), sep = "\n")
# 90208 :"get" "["
# 528448 :"get" "["
# 528448 :"get" "["
# 1072 :"get" "["
# 20608 :"get" "["
There are also lines like new page:"get" "["
for page allocations.
Luckily, these are simple to parse.
parse_memory_report <- function(path) {
report <- readLines(path)
new_pages <- startsWith(report, "new page:")
allocations <- as.numeric(gsub(":.*", "", report[!new_pages]))
total_malloced <- sum(as.numeric(allocations))
message(
"Summary of ", path, ":\n",
sum(new_pages), " new pages allocated\n",
sum(as.numeric(allocations)), " bytes malloced"
)
}
parse_memory_report("dt-merge.out")
# Summary of dt-merge.out:
# 12 new pages allocated
# 32098912 bytes malloced
parse_memory_report("dt-keyed-assign.out")
# Summary of dt-keyed-assign.out:
# 13 new pages allocated
# 14284272 bytes malloced
I got exactly the same results when repeating the experiment.
So keyed assignment has one more page allocation. The default byte size for a page is 2000. I'm not sure how malloc
works, and 2000 is tiny relative to all the allocations, so I'll ignore this difference. Please chastise me if this is dumb.
So, ignoring pages, keyed assignment allocated 55% less memory than the merge.
If you must go for the split-merge approach and the following operation works with your memory, be sure to preallocate as much as possible in order to make the iterations faster. So something like this was the most efficient solution I could come up with when dealing with a similar problem:
dt1 <- data.table(id = c(1, 2, 3, 4),
x1 = c(12, 13, 14, 15),
x2 = c(5, 6, 7, 8),
x3 = c(33, 44, 55, 66),
x4 = c(123, 123, 123, 123))
dt2 <- data.table(id = c(1, 2, 3, 4),
x5 = c(555, 666, 777, 888))
dt1_id <- sort(unique(dt1$id)) # extract all ids that are in dt1
dt1_l_split <- length(dt1_id) # get number of iterations
dt2_l_split <- length(unique(dt2[id %in% dt1_id]$id))
split_dt1 <- vector(mode = "list", length = length(unique(dt1$id))) # preallocate vector
split_dt1 <- lapply(1:dt1_l_split, function(x) dt1[id %in% dt1_id[[x]]]) # fill list with splits
rm(dt1); gc() # remove the large data table to save memory and clean up RAM
dt1 <- lapply(1:dt1_l_split, function(i) {
print(Sys.time())
print(i)
tmp <- dt2[id %in% dt1_id[[i]]] # load relevant parts from dt2
merge(tmp, split_dt1[[i]], all = TRUE) # merge dt1 and dt2
})
rbindlist(dt1)
You could try to use mclapply
from the parallel
package to speed up your computations, I've had mixed result though, sometimes it would really speed things up, sometimes it would be slower, so I guess it' best to try that out.
Alternatively (and imo the easiest solution) just push the project into your Dropbox/Google Drive/Whatever cloud you prefer and set up a Google Cloud VM with 52GB RAM, a few CPUs, and Windows Server (yikes, but no need to set up a GUI, etc. yourself). Took me ~ 10 minutes to set everything up and you get a budget of $300 for the first year, which makes it basically free.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With