Given this <code>data.frame</code>: <pre class="prettyprint"><code>set.seed(4) df <- data.frame(x = rep(1:5, each = 2), y = sample(50:100, 10, T)) # x y # 1 1 78 # 2 1 53 # 3 2 93 # 4 2 96 # 5 3 61 # 6 3 82 # 7 4 53 # 8 4 76 # 9 5 91 # 10 5 99 </code></pre> I would like to write some simple functions (i.e. feature engineering) to create features for <code>x</code> and then join each of the resulting <code>data.frames</code> together. For example: <pre class="prettyprint"><code>library(dplyr) count_x <- function(df) df %>% group_by(x) %>% summarise(count_x = n()) sum_y <- function(df) df %>% group_by(x) %>% summarise(sum_y = sum(y)) mean_y <- function(df) df %>% group_by(x) %>% summarise(mean_y = mean(y)) # and many more... </code></pre> This can be accomplished with <code>plyr::join_all</code> but I am wondering if there is better (or more performant) method with <code>dplyr</code> or <code>data.table</code>? <pre class="prettyprint"><code>df_with_features <- plyr::join_all(list(count_x(df), sum_y(df), mean_y(df)), by = 'x', type = 'full') # > df_with_features # x count_x sum_y mean_y # 1 1 2 131 65.5 # 2 2 2 189 94.5 # 3 3 2 143 71.5 # 4 4 2 129 64.5 # 5 5 2 190 95.0 </code></pre>

Combining @SimonOHanlon's <code>data.table</code> method with @Jaap's <code>Reduce</code> and <code>merge</code> techniques appears to yield the most performant results: <pre class="prettyprint"><code>library(data.table) setDT(df) count_x_dt <- function(dt) dt[, list(count_x = .N), keyby = x] sum_y_dt <- function(dt) dt[, list(sum_y = sum(y)), keyby = x] mean_y_dt <- function(dt) dt[, list(mean_y = mean(y)), keyby = x] Reduce(function(...) merge(..., all = TRUE, by = c("x")), list(count_x_dt(df), sum_y_dt(df), mean_y_dt(df))) </code></pre> Updating to include a <code>tidyverse</code> / <code>purrr</code> (<code>purrr::reduce</code>) approach: <pre class="prettyprint"><code>library(tidyverse) list(count_x(df), sum_y(df), mean_y(df)) %>% reduce(left_join) </code></pre>

Is there a dplyr or data.table equivalent to plyr::join_all? Joining by a list of data frames?

Tags:

r

data.table

dplyr

plyr

Given this data.frame:

set.seed(4)
df <- data.frame(x = rep(1:5, each = 2), y = sample(50:100, 10, T))
#    x  y
# 1  1 78
# 2  1 53
# 3  2 93
# 4  2 96
# 5  3 61
# 6  3 82
# 7  4 53
# 8  4 76
# 9  5 91
# 10 5 99

I would like to write some simple functions (i.e. feature engineering) to create features for x and then join each of the resulting data.frames together. For example:

library(dplyr)
count_x <- function(df) df %>% group_by(x) %>% summarise(count_x = n())
sum_y   <- function(df) df %>% group_by(x) %>% summarise(sum_y = sum(y))
mean_y  <- function(df) df %>% group_by(x) %>% summarise(mean_y = mean(y))  
# and many more...

This can be accomplished with plyr::join_all but I am wondering if there is better (or more performant) method with dplyr or data.table?

df_with_features <- plyr::join_all(list(count_x(df), sum_y(df), mean_y(df)),
                                   by = 'x', type = 'full')

# > df_with_features
#   x count_x sum_y mean_y
# 1 1       2   131   65.5
# 2 2       2   189   94.5
# 3 3       2   143   71.5
# 4 4       2   129   64.5
# 5 5       2   190   95.0

839

asked Nov 24 '15 13:11

JasonAizkalns

2 Answers

Combining @SimonOHanlon's data.table method with @Jaap's Reduce and merge techniques appears to yield the most performant results:

library(data.table)
setDT(df)
count_x_dt <- function(dt) dt[, list(count_x = .N), keyby = x]
sum_y_dt   <- function(dt) dt[, list(sum_y = sum(y)), keyby = x]
mean_y_dt  <- function(dt) dt[, list(mean_y = mean(y)), keyby = x]

Reduce(function(...) merge(..., all = TRUE, by = c("x")), 
       list(count_x_dt(df), sum_y_dt(df), mean_y_dt(df)))

Updating to include a tidyverse / purrr (purrr::reduce) approach:

library(tidyverse)
list(count_x(df), sum_y(df), mean_y(df)) %>% 
  reduce(left_join)

135

answered Nov 08 '22 03:11

JasonAizkalns

In data.table parlance this would be the equivalent of having a sorted keyed data.table and using the key to join the various data.tables.

e.g.

require(data.table)
setDT(df)  #df is now a data.table
df_count <- df[ , list(count_x=.N),by=x]
df_sum <- df[ , list(sum_y = sum(y)),by=x]
#  merge.data.table executes a fast join on the shared key
merge(df_count,df_sum)
#   x count_x sum_y
#1: 1       2   129
#2: 2       2   128
#3: 3       2   154
#4: 4       2   182
#5: 5       2   151

In your example you might write something like this:

count_x <- function(dt) dt[ , list(N = .N) , keyby=x ]
sum_y <- function(dt) dt[ , list(Sum=sum(y)),keyby=x]

#  Then merge...
merge(sum_y(df),count_x(df))
#   x Sum N
#1: 1 129 2
#2: 2 128 2
#3: 3 154 2
#4: 4 182 2
#5: 5 151 2

answered Nov 08 '22 05:11

Simon O'Hanlon

Related questions
                            
                                Lapply in a dataframe over different variables using filters
                            
                                Writing an R package: needing a package I don't explicitly call
                            
                                How Does R Calculate the False Discovery Rate
                            
                                Shiny layout - how to add footer disclaimer?
                            
                                How to convert this confusing line of Python into R
                            
                                How to redefine cov to calculate population covariance matrix
                            
                                Find all combinations of numbers that sum to a target
                            
                                How to schedule an R Script Cronjob in a linux server? [closed]
                            
                                Displaying ggplot2 graphs from R in Jupyter
                            
                                R cannot load package forecast due to namespace error
                            
                                dplyr summarise over nested group_by [duplicate]
                            
                                Apply function across subset of columns in data.table with .SDcols
                            
                                purpose of .RDataTmp temporary file? [R]
                            
                                Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API
                            
                                Tab Box CSS for shinydashboard
                            
                                ggmap's get_map returns 'Error in gzfile(file, "rb") : cannot open the connection'
                            
                                "Error in library(rjson): There is no package called rjson"
                            
                                How do I pass a vector as a parameter in a switch statement
                            
                                Data.frames in R: name autocompletion?
                            
                                Extract all values from list of lists with same vector name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a dplyr or data.table equivalent to plyr::join_all? Joining by a list of data frames?

Tags:

r

data.table

dplyr

plyr

JasonAizkalns

People also ask

2 Answers

JasonAizkalns

Simon O'Hanlon

Recent Activity

Donate For Us