I am trying to get a function similar to expand.grid
and works on data.frame
.
I found a solution in Alternative to expand.grid for data.frames
that uses merge
function to implement this.
Since merge
is quite slow compared to dplyr
alternative full_join
, so I try to use full_join
to implement this function, but I couldn't get it done correctly. Here is an example I failed:
df <- data.frame(attribute = paste0('attr', rep(1:5, each=2)),
value = paste0(rep(1:5, each=2), rep(c('A','B'), 2)),
score = runif(10))
df
attribute value score
1 attr1 1A 0.75600171
2 attr1 1B 0.07086242
3 attr2 2A 0.92403325
4 attr2 2B 0.63414169
5 attr3 3A 0.78763834
6 attr3 3B 0.88576568
7 attr4 4A 0.75998967
8 attr4 4B 0.25205845
9 attr5 5A 0.99304728
10 attr5 5B 0.70389605
I tried to split df
by attribute
and join the list of score together:
dfList <- df %>%
mutate(attribute=1) %>%
split(df$attribute)
And I "expand.grid" all these 5 tables together by:
Reduce(function(x, y) {full_join(x, y, by=c('attribute'='attribute'))}, dfList)
However, the result is weird:
attribute value.x score.x value.y score.y value.x score.x value.y score.y value score
1 1 1A 0.75600171 2A 0.9240333 1A 0.75600171 2A 0.9240333 5A 0.9930473
2 1 1A 0.75600171 2A 0.9240333 1A 0.75600171 2A 0.9240333 5B 0.7038961
3 1 1A 0.75600171 2A 0.9240333 1A 0.75600171 2A 0.9240333 5A 0.9930473
4 1 1A 0.75600171 2A 0.9240333 1A 0.75600171 2A 0.9240333 5B 0.7038961
...
The first 2 tables are shown twice, which is not desired. But when I try this on the first 4 tables it works perfectly:
Reduce(function(x, y) {full_join(x, y, by=c('attribute'='attribute'))}, dfList[1:4])
attribute value.x score.x value.y score.y value.x score.x value.y score.y
1 1 1A 0.75600171 2A 0.9240333 3A 0.7876383 4A 0.7599897
2 1 1A 0.75600171 2A 0.9240333 3A 0.7876383 4B 0.2520584
3 1 1A 0.75600171 2A 0.9240333 3B 0.8857657 4A 0.7599897
4 1 1A 0.75600171 2A 0.9240333 3B 0.8857657 4B 0.2520584
...
Where I did wrong?
I'm using dplyr
0.4.3 with R version 3.2.4
on Ubuntu 14.04
I can reproduce the corrupted result for your dfList
on my machine. It seems to me I've found out why it's happening.
require(dplyr)
adf <- data.frame(c1 = 7, c1 = 8, jv = 1, check.names = F)
bdf <- data.frame(d1 = 1:3, d2 = letters[1:3], jv = 1)
cdf <- data.frame(v1.x = 1:3, v2 = letters[1:3], jv = 1)
ddf <- data.frame(v2 = 4:5, v2.x = letters[4:5], jv = 1)
full_join(adf, bdf, by = "jv")
c1 c1 jv d1 d2
1 7 7 1 1 a
2 7 7 1 2 b
3 7 7 1 3 c
We can notice that having duplicated column names in adf
leads to the wrong result of joining. And when we apply a chain of several joins with the help of Reduce
, the automatic renaming of duplicated column names happens (with adding .x
and .y
by default). This may lead to producing another duplicated names (the opposite to the thing it's intended to avoid).
full_join(cdf, ddf, by = "jv")
v1.x v2.x jv v2.y v2.x
1 1 a 1 4 d
2 1 a 1 5 e
3 2 b 1 4 d
4 2 b 1 5 e
5 3 c 1 4 d
6 3 c 1 5 e
Here we had one duplication of names in different data.frames
- column v2
, which was replaced with another duplication after applying suffixes - v2.x
.
So, to make things work well, we should care about unique names of columns in data.frame
s we're joining.
I've tried several approaches to get the desired result and want to present what they are.
merge
, it's made for speed comparison.full_join
from dplyr
packagedata.table
solution using sequential merge
of dt
stidyr
's unnest
data.table
solution which first generates the key-table of target resulting length (with the help of CJ
) and then makes several left joinson
parameter for joining instead of setting keysrequire(data.table)
require(dplyr)
require(tidyr)
require(stringi)
require(microbenchmark)
expand.grid.df_base <- function(...) {
dfList <- list(...)
if (length(dfList) == 1) dfList <- dfList[[1]]
if (is.null(names(dfList))) names(dfList) <- paste0("df", 1:length(dfList))
lapply(1:length(dfList), function(i)
data.frame(dfN = i, colN = 1:length(dfList[[i]]),
dfname = names(dfList)[i], colname = names(dfList[[i]]),
stringsAsFactors = F)) %>% bind_rows %>%
mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>%
mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names
for (i in 1:nrow(manage_names)) names(dfList[[manage_names$dfN[i]]])[manage_names$colN[i]] <- manage_names$dum_names[i]
Reduce(function(x, y) merge(x, y, by = NULL), dfList) %>% setNames(manage_names$out_names)
}
expand.grid.df_dplyr <- function(...) {
dfList <- list(...)
if (length(dfList) == 1) dfList <- dfList[[1]]
if (is.null(names(dfList))) names(dfList) <- paste0("df", 1:length(dfList))
lapply(1:length(dfList), function(i)
data.frame(dfN = i, colN = 1:length(dfList[[i]]),
dfname = names(dfList)[i], colname = names(dfList[[i]]),
stringsAsFactors = F)) %>% bind_rows %>%
mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>%
mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names
for (i in 1:nrow(manage_names)) names(dfList[[manage_names$dfN[i]]])[manage_names$colN[i]] <- manage_names$dum_names[i]
joinvar <- stri_rand_strings(1, 12)
Reduce(function(x, y) {
mutate_def <- list(1L)
names(mutate_def) <- joinvar
full_join(x %>% mutate_(.dots = mutate_def), y %>% mutate_(.dots = mutate_def), by = joinvar)
}, dfList) %>% select(-contains(joinvar)) %>% setNames(manage_names$out_names) %>% tbl_df
}
expand.grid.dt <- function(...) {
dtList <- list(...)
if (length(dtList) == 1) dtList <- dtList[[1]]
if (!all(sapply(dtList, is.data.table))) dtList <- lapply(dtList, as.data.table)
if (is.null(names(dtList))) setnames(dtList, paste0("dt", 1:length(dtList)))
lapply(1:length(dtList), function(i)
data.frame(dfN = i, colN = 1:length(dtList[[i]]),
dfname = names(dtList)[i], colname = names(dtList[[i]]),
stringsAsFactors = F)) %>% bind_rows %>%
mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>%
mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names
for (i in 1:nrow(manage_names)) setnames(dtList[[manage_names$dfN[i]]], old = manage_names$colN[i], new = manage_names$dum_names[i])
joinvar <- stri_rand_strings(1, 12)
setnames(Reduce(function(x, y) merge(copy(x)[,(joinvar) := 1], copy(y)[,(joinvar) := 1],
by = joinvar, all = T, allow.cartesian = T), dtList)[,(joinvar) := NULL],
manage_names$out_names)[]
}
expand.grid.df_tidyr <- function(...) {
dfList <- list(...)
if (length(dfList) == 1) dfList <- dfList[[1]]
if (is.null(names(dfList))) names(dfList) <- paste0("df", 1:length(dfList))
lapply(1:length(dfList), function(i)
data.frame(dfN = i, colN = 1:length(dfList[[i]]),
dfname = names(dfList)[i], colname = names(dfList[[i]]),
stringsAsFactors = F)) %>% bind_rows %>%
mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>%
mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names
for (i in 1:nrow(manage_names)) names(dfList[[manage_names$dfN[i]]])[manage_names$colN[i]] <- manage_names$dum_names[i]
Reduce(function(x, y) x %>% rowwise %>% mutate(dfcol = list(y)) %>% ungroup %>% unnest(dfcol), dfList) %>%
setNames(manage_names$out_names) %>% tbl_df
}
expand.grid.dt2 <- function(...) {
dtList <- list(...)
if (length(dtList) == 1) dtList <- dtList[[1]]
dum_names <- stri_rand_strings(length(dtList), 12)
dtList <- lapply(1:length(dtList), function(i)
setkeyv(as.data.table(dtList[[i]])[, (dum_names[i]) := .I], dum_names[i]))
Reduce(function(result, dt) setkeyv(result, names(result)[1])[dt][, (names(result)[1]) := NULL],
dtList,
setnames(do.call(CJ, c(sapply(dtList, function(df) seq_len(nrow(df))), list(sorted = F))), dum_names))[]
}
expand.grid.dt3 <- function(...) {
dtList <- list(...)
if (length(dtList) == 1) dtList <- dtList[[1]]
dum_names <- stri_rand_strings(length(dtList), 12)
dtList <- lapply(1:length(dtList), function(i) as.data.table(dtList[[i]])[, (dum_names[i]) := .I])
Reduce(function(result, dt) result[dt, on = names(result)[1]][, (names(result)[1]) := NULL],
dtList,
setnames(do.call(CJ, c(sapply(dtList, function(df) seq_len(nrow(df))), list(sorted = F))), dum_names))[]
}
Now lets create lists of data.frame
s for testing this functions.
set.seed(1)
bigdfList <- data.frame(type = sample(letters[1:10], 50, T),
categ = sample(LETTERS[1:10], 50, T),
num = sample(100L:500L, 50, T),
val = rnorm(50)) %>% split(., .$type)
smalldfList <- data.frame(type = sample(letters[1:5], 50, T),
categ = sample(LETTERS[1:5], 50, T),
num = sample(100L:500L, 50, T),
val = rnorm(50)) %>% split(., .$type)
The expand joinig of smalldfList
produces a table of dimension [60,480 x 20]
and of bigdfList
- [6,451,200 x 40]
which occupies 1230.5 MB of RAM.
Start with smalldfList
.
microbenchmark(expand.grid.df_base(smalldfList), expand.grid.df_dplyr(smalldfList),
expand.grid.dt(smalldfList), expand.grid.df_tidyr(smalldfList),
expand.grid.dt2(smalldfList), expand.grid.dt3(smalldfList), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
expand.grid.df_base(smalldfList) 178.36192 188.54955 201.28729 198.79644 209.86934 229.85360 10 b
expand.grid.df_dplyr(smalldfList) 16.04555 16.91327 18.91094 17.64907 18.45307 29.58192 10 a
expand.grid.dt(smalldfList) 20.33188 21.42275 26.30034 23.22873 31.66666 39.37922 10 a
expand.grid.df_tidyr(smalldfList) 722.06572 738.02188 801.41820 792.23725 859.96186 905.99190 10 c
expand.grid.dt2(smalldfList) 32.22650 33.68353 36.89386 36.39713 37.39182 48.93550 10 a
expand.grid.dt3(smalldfList) 29.13399 30.69299 34.51265 34.03198 37.48651 41.73543 10 a
So, tidyr
solution is not an option here at all, base merge
is also quite slow. Other 4 functions on the bigdfList
show following efficiency.
microbenchmark(expand.grid.df_dplyr(bigdfList), expand.grid.dt(bigdfList),
expand.grid.dt2(bigdfList), expand.grid.dt3(bigdfList), times = 10)
Unit: seconds
expr min lq mean median uq max neval cld
expand.grid.df_dplyr(bigdfList) 1.326336 1.354706 1.456805 1.449781 1.481836 1.703158 10 a
expand.grid.dt(bigdfList) 1.763174 1.820004 1.894813 1.893910 1.939879 2.127097 10 b
expand.grid.dt2(bigdfList) 14.164731 14.332872 14.452933 14.452221 14.551982 14.740852 10 d
expand.grid.dt3(bigdfList) 10.589517 10.828548 11.104010 11.021519 11.368172 11.976976 10 c
And the dplyr::full_join
solution has the best result!
Maybe, it's one of the options where dplyr
is really better than data.table
, maybe it's my lack of data.table
knowledge, which has prevented me from making a really fast function :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With