What are helpful optimizations in R for big data sets?

Q: How do you handle a large data set in R?

There are two options to process very large data sets ( > 10GB) in R. Use integrated environment packages like Rhipe to leverage Hadoop MapReduce framework. Use RHadoop directly on hadoop distributed system.

Tags:

r

data.table

dplyr

tidyverse

I built a script that works great with small data sets (<1 M rows) and performs very poorly with large datasets. I've heard of data table as being more performant than tibbles. I'm interested to know about other speed optimizations in addition to learn about data tables.

I'll share a couple of commands in the script for examples. In each of the examples, the datasets are 10 to 15 million rows and 10 to 15 columns.

Getting the lowest date for a dataframe grouped by nine variables

      dataframe %>% 
      group_by(key_a, key_b, key_c,
               key_d, key_e, key_f,
               key_g, key_h, key_i) %>%
      summarize(min_date = min(date)) %>% 
      ungroup()

Doing a left join on two dataframes to add an additional column

      merge(dataframe, 
          dataframe_two, 
          by = c("key_a", "key_b", "key_c",
               "key_d", "key_e", "key_f",
               "key_g", "key_h", "key_i"),
          all.x = T) %>% 
      as_tibble()

Joining two dataframes on the closest date

      dataframe %>%
      left_join(dataframe_two, 
                  by = "key_a") %>%
      group_by(key_a, date.x) %>%
      summarise(key_z = key_z[which.min(abs(date.x - date.y))]) %>%
      arrange(date.x) %>%
      rename(day = date.x)

What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?

This is an example dataset

set.seed(1010)
library("conflicted")
conflict_prefer("days", "lubridate")
bigint <- rep(
  sample(1238794320934:19082323109, 1*10^7)
)

key_a <-
  rep(c("green", "blue", "orange"), 1*10^7/2)

key_b <-
  rep(c("yellow", "purple", "red"), 1*10^7/2)

key_c <-
  rep(c("hazel", "pink", "lilac"), 1*10^7/2)

key_d <-
  rep(c("A", "B", "C"), 1*10^7/2)

key_e <-
  rep(c("D", "E", "F", "G", "H", "I"), 1*10^7/5)

key_f <-
  rep(c("Z", "M", "Q", "T", "X", "B"), 1*10^7/5)

key_g <-
  rep(c("Z", "M", "Q", "T", "X", "B"), 1*10^7/5)

key_h <-
  rep(c("tree", "plant", "animal", "forest"), 1*10^7/3)

key_i <-
  rep(c("up", "up", "left", "left", "right", "right"), 1*10^7/5)

sequence <- 
  seq(ymd("2010-01-01"), ymd("2020-01-01"), by = "1 day")

date_sequence <-
  rep(sequence, 1*10^7/(length(sequence) - 1))

dataframe <-
  data.frame(
    bigint,
    date = date_sequence[1:(1*10^7)],
    key_a = key_a[1:(1*10^7)],
    key_b = key_b[1:(1*10^7)],
    key_c = key_c[1:(1*10^7)],
    key_d = key_d[1:(1*10^7)],
    key_e = key_e[1:(1*10^7)],
    key_f = key_f[1:(1*10^7)],
    key_g = key_g[1:(1*10^7)],
    key_h = key_h[1:(1*10^7)],
    key_i = key_i[1:(1*10^7)]
  )

dataframe_two <-
  dataframe %>%
      mutate(date_sequence = ymd(date_sequence) + days(1))

sequence_sixdays <-
  seq(ymd("2010-01-01"), ymd("2020-01-01"), by = "6 days")

date_sequence <-
  rep(sequence_sixdays, 3*10^6/(length(sequence_sixdays) - 1))

key_z <-
  sample(1:10000000, 3*10^6)

dataframe_three <-
  data.frame(
    key_a = sample(key_a, 3*10^6),
    date = date_sequence[1:(3*10^6)],
    key_z = key_z[1:(3*10^6)]
  )

670

asked Sep 07 '20 09:09

Cauder

1 Answers

What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?

use data.table package

library(data.table)
d1 = as.data.table(dataframe)
d2 = as.data.table(dataframe_two)

1

grouping by many columns is something that data.table is excellent at
see barchart at the very bottom of the second plot for comparison against dplyr spark and others for exactly this kind of grouping
https://h2oai.github.io/db-benchmark

by_cols = paste("key", c("a","b","c","d","e","f","g","h","i"), sep="_")
a1 = d1[, .(min_date = min(date_sequence)), by=by_cols]

note I changed date to date_sequence, I think you meant that as a column name

2

it is unclear on what fields you want to merge tables, dataframe_two does not have specified fields so the query is invalid
please clarify

3

data.table has very useful type of join called rolling join, which does exactly what you need

a3 = d2[d1, on=c("key_a","date_sequence"), roll="nearest"]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || #!anyDuplicated(f__,  : 
#  Join results in more than 2^31 rows (internal vecseq reached #physical limit). Very likely misspecified join. Check for #duplicate key values in i each of which join to the same group in #x over and over again. If that's ok, try by=.EACHI to run j for #each group to avoid the large allocation. Otherwise, please search #for this error message in the FAQ, Wiki, Stack Overflow and #data.table issue tracker for advice.

It results an error. Error is in fact very useful. On your real data it may work perfectly fine, as the reason behind the error (cardinality of matching rows) may be related to process of generating sample data. It is very tricky to have good dummy data for joining. If you are getting the same error on your real data you may want to review design of that query as it attempts to make row explosion by doing many-to-many join. Even after already considering only single date_sequence identity (taking roll into account). I don't see this kind of question to be valid for that data (cadrinalities of join fields strictly speaking). You may want to introduce data quality checks layer in your workflow to ensure there are no duplicates on key_a and date_sequence combined.

108

answered Oct 24 '22 20:10

jangorecki

Related questions
                            
                                Regression and summary statistics by group within a data.table
                            
                                Error with setwd in R
                            
                                grid.arrange using list of plots
                            
                                Side by side Xtables in Rmarkdown
                            
                                How to define more line types for graphs in R (custom linetype)?
                            
                                Adding two vectors by names
                            
                                Filter each column of a data.frame based on a specific value
                            
                                ggplot bar chart for time series
                            
                                R table function - how to remove 0 counts?
                            
                                Update an entire row in data.table in R
                            
                                Can you more clearly explain lazy evaluation in R function operators?
                            
                                Format latitude and longitude axis labels in ggplot
                            
                                Dollar operator as function argument for sapply not working as expected
                            
                                Separating column using separate (tidyr) via dplyr on a first encountered digit
                            
                                What is the difference between the "+" operator in ggplot2 and the "%>%" operator in magrittr?
                            
                                What is the difference between [[]] and $ in list indexing?
                            
                                Changing axis titles for autoplot
                            
                                Make a group_indices based on several columns
                            
                                Error in dataframe *tmp* replacement has x data has y
                            
                                Obtain importance of individual trees in a RandomForest

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With