Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: faster merging of data frames

Tags:

r

dplyr

I have just started using R and wrote the following code but this is taking about 40 mins to process, so I am sure this can be coded in a way that it runs faster.

Basically, I have one large data set (crsp_td_net) of about 7GB n size and a second smaller data set (ff_35f). Both contain trading dates. What I am trying to do is to fill in trading dates for each company in the first data set.

From my first data set, I am creating subsets of data based on a company index, and then merging each subset with the second data set based on trading dates. This merged data set is appended to the other companies data set and so on until at the end, I am left with a large data set with all the initial companies but with the missing trading days incorporated.

I am not sure at this point whether the fact that the data frame final keeps expanding is causing my loop to run slowly or whether the loop is itself coded inefficiently. I understand that vectorization of the data can help speed this up, but I am not sure how to do this here (the matrix size of the subsets of data keeps changing for each company). I am also not sure of the best way to use apply, sapply or lapply (if any of these can be used here) for this. I have browsed a few queries on R but was I have not found a way to go about this. I would very much appreciate an alternative snippet of code that can make the below run faster.

todo<-matrix(numeric(0), 0,4)

for (i in 1:7396) {
  final<- crsp_td_net %>% 
  filter(compid==i) %>% 
  merge(ff_35f,by="date_crsp",all=TRUE)

  final<-final%>% filter(between(date_crsp, 
                       as.Date(min(date_crsp_orig,na.rm="TRUE")), 
                       as.Date(max(date_crsp_orig, na.rm="TRUE")))) %>%
                arrange(date_crsp) %>% 
                mutate(cusip8dg_compustat = 
                        ifelse(is.na(cusip8dg_compustat), 
                         max(cusip8dg_compustat, na.rm="TRUE"), 
                         cusip8dg_compustat)) %>%
                mutate(compid = ifelse(is.na(compid), i, compid))%>%
                select(compid, cusip8dg_compustat, date_crsp, 
                       date_crsp_orig)%>%  
  distinct()

  todo<-bind_rows(todo,final)
}

Thanks in advance,

Dev

Thank you all for your response. I was unable to reply in the comment box due to limit on response, so I am adding to my original post. @P Lapointe, please find a reproducible data set (I have used integer values instead of actual dates) @eipi10 - I think you have understood what I am after and thanks for the code but I am not sure if it is missing something as it is prompting for an input (I have all relevant libraries). @Alistaire - I will indeed be facing memory problems as I perform more calculations to add to the original data set. Grateful for your suggestions on how to make the loop faster/an alternative to it, which would be very helpful to understand how they would be implemented in the example below.

many thanks

zz <- "compid  date_crsp 
1          1        2
2          1        3     
3          1        5   
4          2        3  
5          2        7 
6          2        9 
7            3        3
8            3        5
9          3        7
10         3        8"
crsp_td_net <- read.table(text=zz, header = TRUE)


xx <- "date_crsp 
1 1 
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10  10
11  11"

ff_35f <- read.table(text=xx, header = TRUE)

# I expect my final output to look like this:

yy<-"compid date_crsp
1   1   2
2   1   3
3   1   4
4   1   5
5   2   3
6   2   4
7   2   5
8   2   6
9   2   7
10  2   8
11  2   9
12  3       3
13  3       4
14  3       5
15  3       6
16  3       7
17  3       8"

output_wanted<-read.table(text=yy, header = TRUE)

df <- full_join(crsp_td_net, expand.grid(compid = unique(crsp_td_net$compid), date_crsp=unique(ff_35f$date_crsp))) 

todo<-array(numeric(),c(1,4,0))
todo<-matrix(numeric(0), 0,0)



for (i in 1:3) {

    final<- filter(crsp_td_net,compid==i)
    final<- mutate(final,date_crsp_orig=date_crsp)
    final<- merge(final,ff_35f, by="date_crsp",all=TRUE)
    final<- filter(final,between(date_crsp, min(date_crsp_orig, na.rm=TRUE),   max(date_crsp_orig, na.rm=TRUE)))
    final<- arrange(final,date_crsp)
    final<- mutate(final,compid = ifelse(is.na(compid), i, compid))
    final<- select(final,compid, date_crsp)
    final<- distinct(final)
    todo<-bind_rows(todo,final)
  }

I have modified the full_join example and it now runs but is not doing what I want it to do re merging each compid with unique trading days to fill in missing trading days in the first data set. I would very much appreciate any suggestion on this please.

The loop I wrote above works to give me exactly what I want, but I was wondering if there is a faster way to do this as I will have to loop over 7000 or so compid to create the large data set todo. This takes about 40 mins to run, so I wonder if there is a faster way to write this loop or an alternative to it.

Many thanks in advance

crsp_td_net$date_crsp_orig <-crsp_td_net$date_crsp

        df <- full_join(crsp_td_net, by="date_crsp", expand.grid(compid = unique(crsp_td_net$compid), date_crsp=unique(ff_35f$date_crsp)) )
    df<- df%>% filter(between(date_crsp, min(date_crsp_orig, na.rm=TRUE), max(date_crsp_orig, na.rm=TRUE)))
    df<- df%>%filter(!compid.x=="NA")%>% select(-compid.y)%>% distinct()%>%arrange(compid.x,date_crsp)
like image 891
Dev V Avatar asked Mar 10 '23 08:03

Dev V


1 Answers

Giving a second thought to this problem, I believe it can be solved at reasonable speed using data.tables' non-equi join. (I'm posting this a separate answer because the approach is quite different to foverlaps().)

library(data.table)

# coerce to data.table
setDT(crsp_td_net)
setDT(ff_35f)

# find start and end date for each company
comp_date_range <- crsp_td_net[, .(start = min(date_crsp), end = max(date_crsp)), 
                               by = compid]

# non equi join: the result contains only rows which fulfill the condition in on = ...
# by = .EACHI executes .SD for each group, returning matching rows for each date
# nomatch = 0 (inner join) skips dates without matching company
temp <- comp_date_range[ff_35f, on = c("start<=date_crsp", "end>=date_crsp"), 
                        .SD, by = .EACHI, nomatch = 0, allow.cartesian = TRUE]

# reorder result for convenience and pick desired columns
result <- temp[order(compid, start), .(compid, date_crsp = start)]

The result is in line with expected output

result
#    compid date_crsp
# 1:      1         2
# 2:      1         3
# 3:      1         4
# 4:      1         5
# 5:      2         3
# 6:      2         4
# 7:      2         5
# 8:      2         6
# 9:      2         7
#10:      2         8
#11:      2         9
#12:      3         3
#13:      3         4
#14:      3         5
#15:      3         6
#16:      3         7
#17:      3         8

Note that there is an implicit assumption that the range of dates given ff_35f covers the whole range of dates used in crsp_td_net. Otherwise, company trades would drop off the result.

Benchmark results

At the time of writing, three different solutions were posted. The OP has measured the elapsed times of all three solutions with his 7 Gb data set and reported the measured elapsed times:

  • 1.12 seconds for the foverlaps() solution
  • 1.41 seconds for the non-equi join solution
  • 11.92 seconds for the tidyr/ dplyr solution

in comments here and here.

As I was quite surprised to find the foverlaps() solution to be faster than the non-equi joins so I ran some benchmarks with varying problem sizes using the microbenchmark package.

The problem size is given by the number of companies. For each company, the trading days are randomly sampled from a selection of 260 "dates" simulating one year without weekends (For details see code below). The data set contains about 130 rows per company on average.

As can be seen from the chart of my own benchmarks (note that both axes are in log scale)

enter image description here

foverlaps() is somewhat faster than non-equi joins for larger problem sizes while non-equi joins is the fastest method for smaller problem sizes. tidyr/ dplyr is almost always the slowest method and a magnitude slower on large problems.

Define function for benchmark runs of problem size n_comp

bm_run <- function(n_comp) {
  # define 1 year of trading dates, simulating weekends
  ff_35f <- sort(outer(1:5, 7*(0:51), `+`))
  # create tradings dates for each company
  crsp_td_net <- rbindlist(lapply(seq_len(n_comp), function(i) {
    # how many trading dates to sample for actual company?
    n_days <- sample(length(ff_35f), 1)
    # sample trading dates
    data.frame(compid = i,
               date_crsp = sort(sample(ff_35f, n_days)))
  }))
  # coerce to data.frame
  setDF(crsp_td_net)
  # turn vector of trading dates into data.frame
  ff_35f <- data.frame(date_crsp = ff_35f)
  # scale down number of repetitions with problem size
  n_times <- as.integer(scales::squish(1000*1000 / nrow(crsp_td_net), c(3, 1000)))
  print(sprintf("%i companies with a total of %i trading dates, %i runs", 
                n_comp, nrow(crsp_td_net), n_times))
  # do the benchmark runs for this problem size
  mb <- microbenchmark::microbenchmark(
    foverlaps = {
      foverlaps(
        setkey(setDT(crsp_td_net)[, .(start = min(date_crsp), end = max(date_crsp)), 
                                  by = compid], start, end),
        setkey(setDT(ff_35f)[, .(date_crsp, end = date_crsp)], date_crsp, end)
      )[order(compid, start), .(compid, date_crsp)]
    },
    non_equi_join = {
      setDT(crsp_td_net)[, .(start = min(date_crsp), end = max(date_crsp)), by = compid
                         ][setDT(ff_35f), on = c("start<=date_crsp", "end>=date_crsp"), 
                           .SD, by = .EACHI, nomatch = 0, allow.cartesian = TRUE
                           ][order(compid, start), .(compid, date_crsp = start)]
    },
    dplyr = {
      setDF(crsp_td_net)
      setDF(ff_35f)
      crsp_td_net %>%
        dplyr::group_by(compid) %>%
        dplyr::summarize(date_crsp = list(seq(from=min(date_crsp), to=max(date_crsp), by=1))) %>%
        tidyr::unnest() %>%
        dplyr::semi_join(ff_35f, by="date_crsp") %>%
        dplyr::arrange(compid, date_crsp)
    },
    times = n_times
  )
  # return problem size and timings as list
  return(list(n_comp, nrow(crsp_td_net), mb))
}

Run benchmark for different problem sizes

library(data.table)
library(magrittr)
# number of companies
n_comp <- outer(c(1,2), 10^(1:4), `*`)
# set seed of RNG for creation of reproducible data
set.seed(1234)
# do benchmark runs with different problem size derived from number of companies
bm <- lapply(n_comp, bm_run)

Prepare data for plotting

# create data.table with benchmark timinings from chunks in returned list
mbl <- rbindlist(lapply(bm, `[[`, i = 3), id = "n_row")
# aggregate results
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
# reorder factor levels 
mba[, expr := forcats::fct_reorder(expr, -median_time)]
# replace chunk number by number of rows
mba[, n_row := unlist(lapply(bm, `[[`, i = 2))[n_row]]

Creat chart

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
  geom_point() + geom_smooth(se = FALSE) + 
  scale_x_log10(breaks = unique(mba$n_row), labels = scales::comma) + 
  scale_y_log10() +
  xlab("number of rows") + ylab("median of execution time [ms]") +
  ggtitle("microbenchmark results") + theme_bw()
like image 156
Uwe Avatar answered Mar 28 '23 13:03

Uwe