df %>% split(.$x)
becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then perform split on each subset we reduce the time by at least an order of magnitude.
library(dplyr)
library(microbenchmark)
library(caret)
library(purrr)
N <- 10^6
groups <- 10^5
df <- data.frame(x = sample(1:groups, N, replace = TRUE),
y = sample(letters, N, replace = TRUE))
ids <- df$x %>% unique
folds10 <- createFolds(ids, 10)
folds100 <- createFolds(ids, 100)
Running microbenchmark
gives us
## Unit: seconds
## expr mean
l1 <- df %>% split(.$x) # 242.11805
l2 <- lapply(folds10, function(id) df %>%
filter(x %in% id) %>% split(.$x)) %>% flatten # 50.45156
l3 <- lapply(folds100, function(id) df %>%
filter(x %in% id) %>% split(.$x)) %>% flatten # 12.83866
Is split
not designed for large groups? Are there any alternatives besides the manual initial subsetting?
My laptop is a macbook pro late 2013, 2.4GHz 8GB
The Pandas provide the feature to split Dataframe according to column index, row index, and column values, etc.
You can also do the following: split(x = df, f = ~ var1 + var2...) This way, you can also achieve the same split dataframe by many variables without using a list in the f parameter.
More an explanation than an answer. Sub-setting a large data.frame is more costly than sub-setting a small data frame
> df100 = df[1:100,]
> idx = c(1, 10, 20)
> microbenchmark(df[idx,], df100[idx,], times=10)
Unit: microseconds
expr min lq mean median uq max neval
df[idx, ] 428.921 441.217 445.3281 442.893 448.022 475.364 10
df100[idx, ] 32.082 32.307 35.2815 34.935 37.107 42.199 10
split()
pays this cost for each group.
The reason can be seen by running Rprof()
> Rprof(); for (i in 1:1000) df[idx,]; Rprof(NULL); summaryRprof()
$by.self
self.time self.pct total.time total.pct
"attr" 1.26 100 1.26 100
$by.total
total.time total.pct self.time self.pct
"attr" 1.26 100 1.26 100
"[.data.frame" 1.26 100 0.00 0
"[" 1.26 100 0.00 0
$sample.interval
[1] 0.02
$sampling.time
[1] 1.26
All of the time is being spent in a call to attr()
. Stepping through the code using debug("[.data.frame")
shows that the pain involves a call like
attr(df, "row.names")
This small example shows a trick that R uses to avoid representing row names that are not present: use c(NA, -5L)
, rather than 1:5
.
> dput(data.frame(x=1:5))
structure(list(x = 1:5), .Names = "x", row.names = c(NA, -5L), class = "data.frame")
Note that attr()
returns a vector -- the row.names are created on the fly, and for a large data.frame a large number of row.names are created
> attr(data.frame(x=1:5), "row.names")
[1] 1 2 3 4 5
So one might expect that even nonsensical row.names would speed the calculation
> dfns = df; rownames(dfns) = rev(seq_len(nrow(dfns)))
> system.time(split(dfns, dfns$x))
user system elapsed
4.048 0.000 4.048
> system.time(split(df, df$x))
user system elapsed
87.772 16.312 104.100
Splitting a vector or matrix would also be fast.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With