Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are helpful optimizations in R for big data sets?

I built a script that works great with small data sets (<1 M rows) and performs very poorly with large datasets. I've heard of data table as being more performant than tibbles. I'm interested to know about other speed optimizations in addition to learn about data tables.

I'll share a couple of commands in the script for examples. In each of the examples, the datasets are 10 to 15 million rows and 10 to 15 columns.

  1. Getting the lowest date for a dataframe grouped by nine variables
      dataframe %>% 
      group_by(key_a, key_b, key_c,
               key_d, key_e, key_f,
               key_g, key_h, key_i) %>%
      summarize(min_date = min(date)) %>% 
      ungroup()
  1. Doing a left join on two dataframes to add an additional column
      merge(dataframe, 
          dataframe_two, 
          by = c("key_a", "key_b", "key_c",
               "key_d", "key_e", "key_f",
               "key_g", "key_h", "key_i"),
          all.x = T) %>% 
      as_tibble()
  1. Joining two dataframes on the closest date
      dataframe %>%
      left_join(dataframe_two, 
                  by = "key_a") %>%
      group_by(key_a, date.x) %>%
      summarise(key_z = key_z[which.min(abs(date.x - date.y))]) %>%
      arrange(date.x) %>%
      rename(day = date.x)

What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?

--

This is an example dataset

set.seed(1010)
library("conflicted")
conflict_prefer("days", "lubridate")
bigint <- rep(
  sample(1238794320934:19082323109, 1*10^7)
)

key_a <-
  rep(c("green", "blue", "orange"), 1*10^7/2)

key_b <-
  rep(c("yellow", "purple", "red"), 1*10^7/2)

key_c <-
  rep(c("hazel", "pink", "lilac"), 1*10^7/2)

key_d <-
  rep(c("A", "B", "C"), 1*10^7/2)

key_e <-
  rep(c("D", "E", "F", "G", "H", "I"), 1*10^7/5)

key_f <-
  rep(c("Z", "M", "Q", "T", "X", "B"), 1*10^7/5)

key_g <-
  rep(c("Z", "M", "Q", "T", "X", "B"), 1*10^7/5)

key_h <-
  rep(c("tree", "plant", "animal", "forest"), 1*10^7/3)

key_i <-
  rep(c("up", "up", "left", "left", "right", "right"), 1*10^7/5)

sequence <- 
  seq(ymd("2010-01-01"), ymd("2020-01-01"), by = "1 day")

date_sequence <-
  rep(sequence, 1*10^7/(length(sequence) - 1))

dataframe <-
  data.frame(
    bigint,
    date = date_sequence[1:(1*10^7)],
    key_a = key_a[1:(1*10^7)],
    key_b = key_b[1:(1*10^7)],
    key_c = key_c[1:(1*10^7)],
    key_d = key_d[1:(1*10^7)],
    key_e = key_e[1:(1*10^7)],
    key_f = key_f[1:(1*10^7)],
    key_g = key_g[1:(1*10^7)],
    key_h = key_h[1:(1*10^7)],
    key_i = key_i[1:(1*10^7)]
  )

dataframe_two <-
  dataframe %>%
      mutate(date_sequence = ymd(date_sequence) + days(1))

sequence_sixdays <-
  seq(ymd("2010-01-01"), ymd("2020-01-01"), by = "6 days")

date_sequence <-
  rep(sequence_sixdays, 3*10^6/(length(sequence_sixdays) - 1))

key_z <-
  sample(1:10000000, 3*10^6)

dataframe_three <-
  data.frame(
    key_a = sample(key_a, 3*10^6),
    date = date_sequence[1:(3*10^6)],
    key_z = key_z[1:(3*10^6)]
  )
like image 670
Cauder Avatar asked Sep 07 '20 09:09

Cauder


People also ask

How do you handle a large data set in R?

There are two options to process very large data sets ( > 10GB) in R. Use integrated environment packages like Rhipe to leverage Hadoop MapReduce framework. Use RHadoop directly on hadoop distributed system.

Why is data table faster than Dplyr?

table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas .

How would you deal with a dataset that was too large to load into memory?

Money-costing solution: One possible solution is to buy a new computer with a more robust CPU and larger RAM that is capable of handling the entire dataset. Or, rent a cloud or a virtual memory and then create some clustering arrangement to handle the workload.


1 Answers

What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?

use data.table package

library(data.table)
d1 = as.data.table(dataframe)
d2 = as.data.table(dataframe_two)

1

grouping by many columns is something that data.table is excellent at
see barchart at the very bottom of the second plot for comparison against dplyr spark and others for exactly this kind of grouping
https://h2oai.github.io/db-benchmark

by_cols = paste("key", c("a","b","c","d","e","f","g","h","i"), sep="_")
a1 = d1[, .(min_date = min(date_sequence)), by=by_cols]

note I changed date to date_sequence, I think you meant that as a column name

2

it is unclear on what fields you want to merge tables, dataframe_two does not have specified fields so the query is invalid
please clarify

3

data.table has very useful type of join called rolling join, which does exactly what you need

a3 = d2[d1, on=c("key_a","date_sequence"), roll="nearest"]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || #!anyDuplicated(f__,  : 
#  Join results in more than 2^31 rows (internal vecseq reached #physical limit). Very likely misspecified join. Check for #duplicate key values in i each of which join to the same group in #x over and over again. If that's ok, try by=.EACHI to run j for #each group to avoid the large allocation. Otherwise, please search #for this error message in the FAQ, Wiki, Stack Overflow and #data.table issue tracker for advice.

It results an error. Error is in fact very useful. On your real data it may work perfectly fine, as the reason behind the error (cardinality of matching rows) may be related to process of generating sample data. It is very tricky to have good dummy data for joining. If you are getting the same error on your real data you may want to review design of that query as it attempts to make row explosion by doing many-to-many join. Even after already considering only single date_sequence identity (taking roll into account). I don't see this kind of question to be valid for that data (cadrinalities of join fields strictly speaking). You may want to introduce data quality checks layer in your workflow to ensure there are no duplicates on key_a and date_sequence combined.

like image 108
jangorecki Avatar answered Oct 24 '22 20:10

jangorecki