Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently merging large data.tables [duplicate]

I have two fairly large data.table objects that I want to merge.

  • dt1 has 500.000.000 observations on 5 columns.
  • dt2 has 300.000 observations on 2 columns.

Both objects have the same key called id.

I want to left_join information from dt2 into dt1.

For example:

dt1  <- data.table(id = c(1, 2, 3, 4),
               x1 = c(12, 13, 14, 15),
               x2 = c(5, 6, 7, 8),
               x3 = c(33, 44, 55, 66),
               x4 = c(123, 123, 123, 123))

dt2 <- data.table(id = c(1, 2, 3, 4),
              x5 = c(555, 666, 777, 888))
setkey(dt1, id)
setkey(dt2, id)

dt2[dt1, on="id"] 

> dt2[dt1, on="id"]
   id  x5 x1 x2 x3  x4
1:  1 555 12  5 33 123
2:  2 666 13  6 44 123
3:  3 777 14  7 55 123
4:  4 888 15  8 66 123

However, when merging my original data R can't allocate memory anymore. Yet, the output of the merge fits in the RAM.

What is the most efficient (speed vs. memory limitations) way of getting this large merge done?

Should we split-apply-combine?

Should we use a DB library to get this done?

How would you do this efficiently?

like image 203
wake_wake Avatar asked May 07 '19 00:05

wake_wake


2 Answers

Keyed assignment should save memory.

dt1[dt2, on = "id", x5 := x5]

Should we use a DB library to get this done?

That's probably a good idea. If setting up and using a database is painful for you, try the RSQLite package. It's pretty simple.


My experiment

tl;dr: 55% less memory used by keyed assignment compared to merge-and-replace, for a toy example.

I wrote two scripts that each sourced a setup script, dt-setup.R to create dt1 and dt2. The first script, dt-merge.R, updated dt1 through the "merge" method. The second, dt-keyed-assign.R, used keyed assignment. Both scripts recorded memory allocations using the Rprofmem() function.

To not torture my laptop, I'm having dt1 be 500,000 rows and dt2 3,000 rows.

Scripts:

# dt-setup.R
library(data.table)

set.seed(9474)
id_space <- seq_len(3000)
dt1  <- data.table(
  id = sample(id_space, 500000, replace = TRUE),
  x1 = runif(500000),
  x2 = runif(500000),
  x3 = runif(500000),
  x4 = runif(500000)
)
dt2 <- data.table(
  id = id_space,
  x5 = 11 * id_space
)
setkey(dt1, id)
setkey(dt2, id)
# dt-merge.R
source("dt-setup.R")
Rprofmem(filename = "dt-merge.out")
dt1 <- dt2[dt1, on = "id"]
Rprofmem(NULL)
# dt-keyed-assign.R
source("dt-setup.R")
Rprofmem(filename = "dt-keyed-assign.out")
dt1[dt2, on = "id", x5 := x5]
Rprofmem(NULL)

With all three scripts in my working directory, I ran each of the joining scripts in a separate R process.

system2("Rscript", "dt-merge.R")
system2("Rscript", "dt-keyed-assign.R")

I think the lines in the output files generally follow the pattern "<bytes> :<call stack>". I haven't found good documentation for this. However, the numbers in the front were never below 128, and this is the default minimum number of bytes below which R does not malloc for vectors.

Note that not all of these allocations add to the total memory used by R. R might reuse some memory it already has after a garbage collection. So it's not a good way to measure how much memory is used at any specific time. However, if we assume garbage collection behavior is independent, it does work as a comparison between scripts.

Some sample lines of the memory report:

cat(readLines("dt-merge.out", 5), sep = "\n")
# 90208 :"get" "[" 
# 528448 :"get" "[" 
# 528448 :"get" "[" 
# 1072 :"get" "[" 
# 20608 :"get" "["

There are also lines like new page:"get" "[" for page allocations.

Luckily, these are simple to parse.

parse_memory_report <- function(path) {
  report <- readLines(path)
  new_pages <- startsWith(report, "new page:")
  allocations <- as.numeric(gsub(":.*", "", report[!new_pages]))
  total_malloced <- sum(as.numeric(allocations))
  message(
    "Summary of ", path, ":\n",
    sum(new_pages), " new pages allocated\n",
    sum(as.numeric(allocations)), " bytes malloced"
  )
}

parse_memory_report("dt-merge.out")
# Summary of dt-merge.out:
# 12 new pages allocated
# 32098912 bytes malloced

parse_memory_report("dt-keyed-assign.out")
# Summary of dt-keyed-assign.out:
# 13 new pages allocated
# 14284272 bytes malloced

I got exactly the same results when repeating the experiment.

So keyed assignment has one more page allocation. The default byte size for a page is 2000. I'm not sure how malloc works, and 2000 is tiny relative to all the allocations, so I'll ignore this difference. Please chastise me if this is dumb.

So, ignoring pages, keyed assignment allocated 55% less memory than the merge.

like image 112
Nathan Werth Avatar answered Oct 12 '22 23:10

Nathan Werth


If you must go for the split-merge approach and the following operation works with your memory, be sure to preallocate as much as possible in order to make the iterations faster. So something like this was the most efficient solution I could come up with when dealing with a similar problem:

dt1  <- data.table(id = c(1, 2, 3, 4),
                   x1 = c(12, 13, 14, 15),
                   x2 = c(5, 6, 7, 8),
                   x3 = c(33, 44, 55, 66),
                   x4 = c(123, 123, 123, 123))

dt2 <- data.table(id = c(1, 2, 3, 4),
                  x5 = c(555, 666, 777, 888))

dt1_id <- sort(unique(dt1$id)) # extract all ids that are in dt1
dt1_l_split <- length(dt1_id) # get number of iterations
dt2_l_split <- length(unique(dt2[id %in% dt1_id]$id))

split_dt1 <- vector(mode = "list", length = length(unique(dt1$id))) # preallocate vector
split_dt1 <- lapply(1:dt1_l_split, function(x) dt1[id %in% dt1_id[[x]]]) # fill list with splits

rm(dt1); gc() # remove the large data table to save memory and clean up RAM

dt1 <- lapply(1:dt1_l_split, function(i) {
  print(Sys.time())
  print(i)

  tmp <- dt2[id %in% dt1_id[[i]]] # load relevant parts from dt2
  merge(tmp, split_dt1[[i]], all = TRUE) # merge dt1 and dt2
})
rbindlist(dt1)

You could try to use mclapply from the parallel package to speed up your computations, I've had mixed result though, sometimes it would really speed things up, sometimes it would be slower, so I guess it' best to try that out.

Alternatively (and imo the easiest solution) just push the project into your Dropbox/Google Drive/Whatever cloud you prefer and set up a Google Cloud VM with 52GB RAM, a few CPUs, and Windows Server (yikes, but no need to set up a GUI, etc. yourself). Took me ~ 10 minutes to set everything up and you get a budget of $300 for the first year, which makes it basically free.

like image 44
YalDan Avatar answered Oct 13 '22 01:10

YalDan