<pre class="prettyprint"><code>df %>% split(.$x) </code></pre> becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then perform split on each subset we reduce the time by at least an order of magnitude. <pre class="prettyprint"><code>library(dplyr) library(microbenchmark) library(caret) library(purrr) N <- 10^6 groups <- 10^5 df <- data.frame(x = sample(1:groups, N, replace = TRUE), y = sample(letters, N, replace = TRUE)) ids <- df$x %>% unique folds10 <- createFolds(ids, 10) folds100 <- createFolds(ids, 100) </code></pre> Running <code>microbenchmark</code> gives us <pre class="prettyprint"><code>## Unit: seconds ## expr mean l1 <- df %>% split(.$x) # 242.11805 l2 <- lapply(folds10, function(id) df %>% filter(x %in% id) %>% split(.$x)) %>% flatten # 50.45156 l3 <- lapply(folds100, function(id) df %>% filter(x %in% id) %>% split(.$x)) %>% flatten # 12.83866 </code></pre> Is <code>split</code> not designed for large groups? Are there any alternatives besides the manual initial subsetting? My laptop is a macbook pro late 2013, 2.4GHz 8GB

More an explanation than an answer. Sub-setting a large data.frame is more costly than sub-setting a small data frame <pre class="prettyprint"><code>> df100 = df[1:100,] > idx = c(1, 10, 20) > microbenchmark(df[idx,], df100[idx,], times=10) Unit: microseconds expr min lq mean median uq max neval df[idx, ] 428.921 441.217 445.3281 442.893 448.022 475.364 10 df100[idx, ] 32.082 32.307 35.2815 34.935 37.107 42.199 10 </code></pre> <code>split()</code> pays this cost for each group. The reason can be seen by running <code>Rprof()</code> <pre class="prettyprint"><code>> Rprof(); for (i in 1:1000) df[idx,]; Rprof(NULL); summaryRprof() $by.self self.time self.pct total.time total.pct "attr" 1.26 100 1.26 100 $by.total total.time total.pct self.time self.pct "attr" 1.26 100 1.26 100 "[.data.frame" 1.26 100 0.00 0 "[" 1.26 100 0.00 0 $sample.interval [1] 0.02 $sampling.time [1] 1.26 </code></pre> All of the time is being spent in a call to <code>attr()</code>. Stepping through the code using <code>debug("[.data.frame")</code> shows that the pain involves a call like <pre class="prettyprint"><code>attr(df, "row.names") </code></pre> This small example shows a trick that R uses to avoid representing row names that are not present: use <code>c(NA, -5L)</code>, rather than <code>1:5</code>. <pre class="prettyprint"><code>> dput(data.frame(x=1:5)) structure(list(x = 1:5), .Names = "x", row.names = c(NA, -5L), class = "data.frame") </code></pre> Note that <code>attr()</code> returns a vector -- the row.names are created on the fly, and for a large data.frame a large number of row.names are created <pre class="prettyprint"><code>> attr(data.frame(x=1:5), "row.names") [1] 1 2 3 4 5 </code></pre> So one might expect that even nonsensical row.names would speed the calculation <pre class="prettyprint"><code>> dfns = df; rownames(dfns) = rev(seq_len(nrow(dfns))) > system.time(split(dfns, dfns$x)) user system elapsed 4.048 0.000 4.048 > system.time(split(df, df$x)) user system elapsed 87.772 16.312 104.100 </code></pre> Splitting a vector or matrix would also be fast.

Why is split inefficient on large data frames with many groups?

Tags:

performance

r

purrr

df %>% split(.$x)

becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then perform split on each subset we reduce the time by at least an order of magnitude.

library(dplyr)
library(microbenchmark)
library(caret)
library(purrr)

N      <- 10^6
groups <- 10^5
df     <- data.frame(x = sample(1:groups, N, replace = TRUE), 
                     y = sample(letters,  N, replace = TRUE))
ids      <- df$x %>% unique
folds10  <- createFolds(ids, 10)
folds100 <- createFolds(ids, 100)

Running microbenchmark gives us

## Unit: seconds

## expr                                                  mean
l1 <- df %>% split(.$x)                                # 242.11805

l2 <- lapply(folds10,  function(id) df %>% 
      filter(x %in% id) %>% split(.$x)) %>% flatten    # 50.45156  

l3 <- lapply(folds100, function(id) df %>% 
      filter(x %in% id) %>% split(.$x)) %>% flatten    # 12.83866

Is split not designed for large groups? Are there any alternatives besides the manual initial subsetting?

My laptop is a macbook pro late 2013, 2.4GHz 8GB

750

asked Sep 17 '16 09:09

Rickard

1 Answers

More an explanation than an answer. Sub-setting a large data.frame is more costly than sub-setting a small data frame

> df100 = df[1:100,]
> idx = c(1, 10, 20)
> microbenchmark(df[idx,], df100[idx,], times=10)
Unit: microseconds
         expr     min      lq     mean  median      uq     max neval
    df[idx, ] 428.921 441.217 445.3281 442.893 448.022 475.364    10
 df100[idx, ]  32.082  32.307  35.2815  34.935  37.107  42.199    10

split() pays this cost for each group.

The reason can be seen by running Rprof()

> Rprof(); for (i in 1:1000) df[idx,]; Rprof(NULL); summaryRprof()
$by.self
       self.time self.pct total.time total.pct
"attr"      1.26      100       1.26       100

$by.total
               total.time total.pct self.time self.pct
"attr"               1.26       100      1.26      100
"[.data.frame"       1.26       100      0.00        0
"["                  1.26       100      0.00        0

$sample.interval
[1] 0.02

$sampling.time
[1] 1.26

All of the time is being spent in a call to attr(). Stepping through the code using debug("[.data.frame") shows that the pain involves a call like

attr(df, "row.names")

This small example shows a trick that R uses to avoid representing row names that are not present: use c(NA, -5L), rather than 1:5.

> dput(data.frame(x=1:5))
structure(list(x = 1:5), .Names = "x", row.names = c(NA, -5L), class = "data.frame")

Note that attr() returns a vector -- the row.names are created on the fly, and for a large data.frame a large number of row.names are created

> attr(data.frame(x=1:5), "row.names")
[1] 1 2 3 4 5

So one might expect that even nonsensical row.names would speed the calculation

> dfns = df; rownames(dfns) = rev(seq_len(nrow(dfns)))
> system.time(split(dfns, dfns$x))
   user  system elapsed 
  4.048   0.000   4.048 
> system.time(split(df, df$x))
   user  system elapsed 
 87.772  16.312 104.100

Splitting a vector or matrix would also be fast.

answered Sep 20 '22 15:09

Martin Morgan

Related questions
                            
                                Extract last word in a string after comma if there are multiple words else the first word
                            
                                ggplot2: add conditional density curves describing both dimensions of scatterplot
                            
                                unable to install R ggmap package: compilation failed for package ‘jpeg’
                            
                                swimmer survival plot
                            
                                R: Pass data.frame by reference to a function
                            
                                Rstudio-server unable to connect to service
                            
                                How to visualize pairwise comparisons with `ggplot2`?
                            
                                Different pages in Shiny App
                            
                                Dynamic selectInput in R shiny
                            
                                Split character string multiple times every two characters
                            
                                How to use the for loop with function needing for a string field?
                            
                                Error: nrow(x) == n is not TRUE when using Train in Caret
                            
                                R caret: Maximizing sensitivity for manually defined positive class for training (classification),
                            
                                data.table and pmin with na.rm=TRUE argument
                            
                                R Shiny - Audio Playback
                            
                                Create two R functions with same name but different type of argument
                            
                                ggplot alpha levels appear different on fill and border of points (ringing artefact)
                            
                                Rscript: command not found
                            
                                Avoiding hoizontal lines and crazy shapes when plotting maps in ggplot2
                            
                                Installing rpy2 for Python 3 using pip

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With