Efficient method for counting open cases at time of each case's submission in large data set

Tags:

In a large data set (~1M cases), each case has a "created" and a "censored" dateTime. I want to count the number of other cases that were open at the time each case was created. Cases are open between their "created" and "censored" dataTimes.

Several solutions work well on small datasets (<100,000 cases), but computation time grows exponentially. My estimate is that computation time increases as a function 3n^2. By n=100,000 cases, computation time is >20 mins on my server with 6 * 4GHz cores & 64GB RAM. Even with multi-core libraries, at best, I could reduce the time by a factor of 8 or 10. Not enough to handle ~1M cases.

I'm looking for a more efficient method to do this calculation. Below I have provided a function that allows you to easily create large numbers of "created" and "censored" dateTime pairs along with two solutions tried so far, using dplyr and data.table libraries. The timings are reported to the user for simplicity. You can simply change the "CASE_COUNT" variable at the top to re-execute and view times again and easily compare the timing of other solutions you might have to suggest.

I will update the original post with other solutions to give proper credit to their authors. Thanks in advance for your help with this!

# Load libraries used in this example
library(dplyr);
library(data.table);
# Not on CRAN. See: http://bioconductor.org/packages/release/bioc/html/IRanges.html
library(IRanges);

# Set seed for reproducibility 
set.seed(123)

# Set number of cases & date range variables
CASE_COUNT  <<- 1000;
RANGE_START <- as.POSIXct("2000-01-01 00:00:00", 
                          format="%Y-%m-%d %H:%M:%S", 
                          tz="UTC", origin="1970-01-01");
RANGE_END   <- as.POSIXct("2012-01-01 00:00:00", 
                          format="%Y-%m-%d %H:%M:%S", 
                          tz="UTC", origin="1970-01-01");

# Select which solutions you want to run in this test           
RUN_SOLUTION_1 <- TRUE;     # dplyr::summarize() + comparisons
RUN_SOLUTION_2 <- TRUE;     # data.table:foverlaps()
RUN_SOLUTION_3 <- TRUE;     # data.table aggregation + comparisons
RUN_SOLUTION_4 <- TRUE;     # IRanges::IRanges + countOverlaps()
RUN_SOLUTION_5 <- TRUE;     # data.table::frank()

# Function to generate random creation & censor dateTime pairs
# The censor time always has to be after the creation time
# Credit to @DirkEddelbuettel for this smart function
# (https://stackoverflow.com/users/143305/dirk-eddelbuettel)

generate_cases_table <- function(n = CASE_COUNT, start_val=RANGE_START, end_val=RANGE_END) {
    # Measure duration between start_val & end_val
    duration <- as.numeric(difftime(end_val, start_val, unit="secs"));

    # Select random values in duration to create start_offset
    start_offset   <- runif(n, 0, duration);

    # Calculate the creation time list
    created_list  <- start_offset + start_val;

    # Calculate acceptable time range for censored values
    # since they must always be after their respective creation value
    censored_range <- as.numeric(difftime(RANGE_END, created_list, unit="secs"));

    # Select random values in duration to create end_offset
    creation_to_censored_times <- runif(n, 0, censored_range);

    censored_list <- created_list + creation_to_censored_times;

    # Create and return a data.table with creation & censor values
    # calculated from start or end with random offsets
    return_table  <- data.table(id       = 1:n,
                                created  = created_list,
                                censored = censored_list);

    return(return_table);
}

# Create the data table with the desired number of cases specified by CASE_COUNT above
cases_table <- generate_cases_table();

solution_1_function <- function (cases_table) { 
    # SOLUTION 1: Using dplyr::summarize:

    # Group by id to set parameters for summarize() function 
    cases_table_grouped <- group_by(cases_table, id);

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() within summarize()

    cases_table_summary <- summarize(cases_table_grouped, 
                           open_cases_at_creation = sum((cases_table$created  < created & 
                                                         cases_table$censored > created)));
    solution_1_table <<- as.data.table(cases_table_summary, key="id");        
} # End solution_1_function

solution_2_function <- function (cases_table) {
    # SOLUTION 2: Using data.table::foverlaps:

    # Adapted from solution provided by @Davidarenburg
    # (https://stackoverflow.com/users/3001626/david-arenburg)

    # The foverlaps() solution tends to crash R with large case counts
    # I suspect it has to do with memory assignment of the very large objects
    # It maxes RAM on my system (64GB) before crashing, possibly attempting
    # to write beyond its assigned memory limits.
    # I'll submit a reproduceable bug to the data.table team since
    # foverlaps() is pretty new and known to be occasionally unstable

    if (CASE_COUNT > 50000) {
        stop("The foverlaps() solution tends to crash R with large case counts. Not running.");
    }

    setDT(cases_table)[, created_dupe := created];
    setkey(cases_table, created, censored);

    foverlaps_table  <- foverlaps(cases_table[,c("id","created","created_dupe"), with=FALSE],
                                  cases_table[,c("id","created","censored"),    with=FALSE], 
                                  by.x=c("created","created_dupe"))[order(i.id),.N-1,by=i.id];

    foverlaps_table  <- dplyr::rename(foverlaps_table, id=i.id, open_cases_at_creation=V1);

    solution_2_table <<- as.data.table(foverlaps_table, key="id");
} # End solution_2_function

solution_3_function <- function (cases_table) {    
    # SOLUTION 3: Using data.table aggregation instead of dplyr::summarize

    # Idea suggested by @jangorecki
    # (https://stackoverflow.com/users/2490497/jangorecki)

    # Count the instances where other cases were created before
    # and censored after each case using vectorized sum() with data.table aggregation

    cases_table_aggregated <- cases_table[order(id), sum((cases_table$created  < created & 
                                                     cases_table$censored > created)),by=id];   

    solution_3_table <<- as.data.table(dplyr::rename(cases_table_aggregated, open_cases_at_creation=V1), key="id");

} # End solution_3_function

solution_4_function <- function (cases_table) { 
    # SOLUTION 4: Using IRanges package

    # Adapted from solution suggested by @alexis_laz
    # (https://stackoverflow.com/users/2414948/alexis-laz)

    # The IRanges package generates ranges efficiently, intended for genome sequencing
    # but working perfectly well on this data, since POSIXct values are numeric-representable
    solution_4_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = countOverlaps(IRanges(cases_table$created, 
                                                                    cases_table$created), 
                                                            IRanges(cases_table$created, 
                                                                    cases_table$censored))-1, key="id");

} # End solution_4_function

solution_5_function <- function (cases_table) {
    # SOLUTION 5: Using data.table::frank()

    # Adapted from solution suggested by @danas.zuokas
    # (https://stackoverflow.com/users/1249481/danas-zuokas)

    n <- CASE_COUNT;

    # For every case compute the number of other cases
    # with `created` less than `created` of other cases
    r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]), ties.method = 'first')[1:n];

    # For every case compute the number of other cases
    # with `censored` less than `created`
    r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]), ties.method = 'first')[1:n];

    solution_5_table <<- data.table(id      = cases_table$id,
                     open_cases_at_creation = r1 - r2, key="id");

} # End solution_5_function;

# Execute user specified functions;
if (RUN_SOLUTION_1)
    solution_1_timing <- system.time(solution_1_function(cases_table)); 
if (RUN_SOLUTION_2) {
    solution_2_timing <- try(system.time(solution_2_function(cases_table)));
    cases_table <- select(cases_table, -created_dupe);
}
if (RUN_SOLUTION_3)
    solution_3_timing <- system.time(solution_3_function(cases_table)); 
if (RUN_SOLUTION_4)
    solution_4_timing <- system.time(solution_4_function(cases_table));
if (RUN_SOLUTION_5)
    solution_5_timing <- system.time(solution_5_function(cases_table));         

# Check generated tables for comparison
if (RUN_SOLUTION_1 && RUN_SOLUTION_2 && class(solution_2_timing)!="try-error") {
    same_check1_2 <- all(solution_1_table$open_cases_at_creation == solution_2_table$open_cases_at_creation);
} else {same_check1_2 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_3) {
    same_check1_3 <- all(solution_1_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check1_3 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_4) {
    same_check1_4 <- all(solution_1_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check1_4 <- TRUE;}
if (RUN_SOLUTION_1 && RUN_SOLUTION_5) {
    same_check1_5 <- all(solution_1_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check1_5 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_3 && class(solution_2_timing)!="try-error") {
    same_check2_3 <- all(solution_2_table$open_cases_at_creation == solution_3_table$open_cases_at_creation);
} else {same_check2_3 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_4 && class(solution_2_timing)!="try-error") {
    same_check2_4 <- all(solution_2_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check2_4 <- TRUE;}
if (RUN_SOLUTION_2 && RUN_SOLUTION_5 && class(solution_2_timing)!="try-error") {
    same_check2_5 <- all(solution_2_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check2_5 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_4) {
    same_check3_4 <- all(solution_3_table$open_cases_at_creation == solution_4_table$open_cases_at_creation);
} else {same_check3_4 <- TRUE;}
if (RUN_SOLUTION_3 && RUN_SOLUTION_5) {
    same_check3_5 <- all(solution_3_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check3_5 <- TRUE;}
if (RUN_SOLUTION_4 && RUN_SOLUTION_5) {
    same_check4_5 <- all(solution_4_table$open_cases_at_creation == solution_5_table$open_cases_at_creation);
} else {same_check4_5 <- TRUE;}


same_check    <- all(same_check1_2, same_check1_3, same_check1_4, same_check1_5,
                     same_check2_3, same_check2_4, same_check2_5, same_check3_4,
                     same_check3_5, same_check4_5);

# Report summary of results to user
cat("This execution was for", CASE_COUNT, "cases.\n",
    "It is", same_check, "that all solutions match.\n");
if (RUN_SOLUTION_1)
    cat("The dplyr::summarize() solution took", solution_1_timing[3], "seconds.\n");
if (RUN_SOLUTION_2 && class(solution_2_timing)!="try-error")
    cat("The data.table::foverlaps() solution took", solution_2_timing[3], "seconds.\n");
if (RUN_SOLUTION_3)
    cat("The data.table aggregation solution took", solution_3_timing[3], "seconds.\n");
if (RUN_SOLUTION_4)
    cat("The IRanges solution solution took", solution_4_timing[3], "seconds.\n");
if (RUN_SOLUTION_5)
    cat("The data.table:frank() solution solution took", solution_5_timing[3], "seconds.\n\n");

The data.table::foverlaps() solution is faster for fewer cases (<5000 or so; depends on randomness in addition to n since it uses binary search to optimize). The dplyr::summarize() solution is faster for more cases (>5,000 or so). Much beyond 100,000, neither solution is viable as they're both too slow.

EDIT: Added a third solution based on the idea suggested by @jangorecki that uses data.table aggregation instead of dplyr::summarize(), and is otherwise similar to the dplyr solution. For up to around 50,000 cases, it is the fastest solution. Beyond 50,000 cases, the dplyr::summarize() solution is slightly faster, but not by much. Sadly, for 1M cases it's still not practical.

EDIT2: Added a fourth solution adapted from the solution suggested by @alexis_laz that uses the IRanges package and its countOverlaps function. It is significantly faster than the 3 other solutions. With 50,000 cases it was almost 400% faster than solutions 1 & 3.

EDIT3: Modified case generating function to properly exercise the "censored" condition. Thanks to @jangorecki for catching the limitation of the previous version.

EDIT4: Rewrote to allow user to select which solutions to execute and to use system.time() for timing comparison with garbage collection before each execution for more accurate timing (as per @jangorecki's astute observation) - Also added some condition checks for crash cases.

EDIT5: Added a fifth solution adapted from the solution suggested by @danas.zuokas using rank(). My experimentation suggests that it's always at least an order of magnitude slower than the other solutions. At 10,000 cases, it takes 44 seconds vs. 3.5 seconds for dplyr::summarize and 0.36 seconds for IRanges solutions.

FINAL EDIT: I've made slight modifications to solution 5 that was suggested by @danas.zuokas, and matching the observation by @Khashaa about types. I've set the type as.numeric in the dataTime generation function which drastically speeds up rank as it operates on integers or doubles instead of dateTime objects(increases speed of other functions too, but not as drastically). With some testing, setting ties.method='first' yields results consistent with the intent. data.table::frank is faster than both base::rank and IRanges::rank. bit64::rank is fastest, but it seems to handle ties differently than data.table::frank and I can't get it to handle them as desired. Once bit64 is loaded, it masks a large number of types and functions, changing the results of data.table::frank along the way. The specific reasons why are beyond the scope of this question.

POST END NOTE: Turns out that data.table::frank handles POSIXct dateTimes efficiently, whereas neither base::rank nor IRanges::rank seem to. As such, even the as.numeric (or as.integer) type setting isn't necessary with data.table::frank and there is no loss of precision from the conversion, so there are fewer ties.method discrepancies. Thank you to everyone who contributed! I learned a lot! Much appreciated! :) Credit will be included in my source-code.

ENDNOTE: This question is a refined and clarified version, with easier to use and more readable example code, of More efficient method for counting open cases as of creation time of each case - I've separated it here to not overwhelm the original post with too many edits and to simplify the creation of large numbers of dataTime pairs in the example code. That way, you don't have to work as hard to answer. Thanks again!

340

asked Dec 12 '15 21:12

Mekki MacAulay

Video Answer

1 Answers

The answer is updated with the regards to a comment by the author of the question.

I would suggest a solution using ranks. Tables are created as in a follow up to this question, or using the dateTime pairs generation function in the present question. Both should work.

n <- cases_table[, .N]

# For every case compute the number of other cases
# with `created` less than `creation` of other cases
r1 <- data.table::frank(c(cases_table[, created], cases_table[, created]),
           ties.method = 'first')[1:n]

# For every case compute the number of other cases
# with `censored` less than `created`
r2 <- data.table::frank(c(cases_table[, created], cases_table[, censored]),
           ties.method = 'first')[1:n]

Taking difference r1 - r2 (-1 not required with ties.method='first') gives the result (eliminating ranks of created). In terms of efficiency it takes only finding ranks of a vector of length of that number of rows in cases_table. data.table::frank handles POSIXct dateTime objects as quickly as numeric objects (unlike base::rank), so no type conversion is required.

answered Oct 06 '22 18:10

danas.zuokas

Related questions
                            
                                Writing an R package: needing a package I don't explicitly call
                            
                                How Does R Calculate the False Discovery Rate
                            
                                Shiny layout - how to add footer disclaimer?
                            
                                How to convert this confusing line of Python into R
                            
                                How to redefine cov to calculate population covariance matrix
                            
                                Find all combinations of numbers that sum to a target
                            
                                How to schedule an R Script Cronjob in a linux server? [closed]
                            
                                Displaying ggplot2 graphs from R in Jupyter
                            
                                R cannot load package forecast due to namespace error
                            
                                dplyr summarise over nested group_by [duplicate]
                            
                                Apply function across subset of columns in data.table with .SDcols
                            
                                purpose of .RDataTmp temporary file? [R]
                            
                                Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API
                            
                                Tab Box CSS for shinydashboard
                            
                                ggmap's get_map returns 'Error in gzfile(file, "rb") : cannot open the connection'
                            
                                "Error in library(rjson): There is no package called rjson"
                            
                                How do I pass a vector as a parameter in a switch statement
                            
                                Data.frames in R: name autocompletion?
                            
                                Extract all values from list of lists with same vector name
                            
                                Is there a dplyr or data.table equivalent to plyr::join_all? Joining by a list of data frames?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient method for counting open cases at time of each case's submission in large data set

Tags:

performance

r

vectorization

data.table

dplyr

Mekki MacAulay

People also ask

Video Answer

1 Answers

danas.zuokas

Recent Activity

Donate For Us