Given this data.frame
:
set.seed(4)
df <- data.frame(x = rep(1:5, each = 2), y = sample(50:100, 10, T))
# x y
# 1 1 78
# 2 1 53
# 3 2 93
# 4 2 96
# 5 3 61
# 6 3 82
# 7 4 53
# 8 4 76
# 9 5 91
# 10 5 99
I would like to write some simple functions (i.e. feature engineering) to create features for x
and then join each of the resulting data.frames
together. For example:
library(dplyr)
count_x <- function(df) df %>% group_by(x) %>% summarise(count_x = n())
sum_y <- function(df) df %>% group_by(x) %>% summarise(sum_y = sum(y))
mean_y <- function(df) df %>% group_by(x) %>% summarise(mean_y = mean(y))
# and many more...
This can be accomplished with plyr::join_all
but I am wondering if there is better (or more performant) method with dplyr
or data.table
?
df_with_features <- plyr::join_all(list(count_x(df), sum_y(df), mean_y(df)),
by = 'x', type = 'full')
# > df_with_features
# x count_x sum_y mean_y
# 1 1 2 131 65.5
# 2 2 2 189 94.5
# 3 3 2 143 71.5
# 4 4 2 129 64.5
# 5 5 2 190 95.0
To combine data frames stored in a list in R, we can use full_join function of dplyr package inside Reduce function.
To join by different variables on x and y , use a named vector. For example, by = c("a" = "b") will match x$a to y$b . To join by multiple variables, use a vector with length > 1. For example, by = c("a", "b") will match x$a to y$a and x$b to y$b .
If the columns you want to join by don't have the same name, you need to tell merge which columns you want to join by: by. x for the x data frame column name, and by. y for the y one, such as merge(df1, df2, by. x = "df1ColName", by.
Combining @SimonOHanlon's data.table
method with @Jaap's Reduce
and merge
techniques appears to yield the most performant results:
library(data.table)
setDT(df)
count_x_dt <- function(dt) dt[, list(count_x = .N), keyby = x]
sum_y_dt <- function(dt) dt[, list(sum_y = sum(y)), keyby = x]
mean_y_dt <- function(dt) dt[, list(mean_y = mean(y)), keyby = x]
Reduce(function(...) merge(..., all = TRUE, by = c("x")),
list(count_x_dt(df), sum_y_dt(df), mean_y_dt(df)))
Updating to include a tidyverse
/ purrr
(purrr::reduce
) approach:
library(tidyverse)
list(count_x(df), sum_y(df), mean_y(df)) %>%
reduce(left_join)
In data.table
parlance this would be the equivalent of having a sorted keyed data.table and using the key to join the various data.tables.
e.g.
require(data.table)
setDT(df) #df is now a data.table
df_count <- df[ , list(count_x=.N),by=x]
df_sum <- df[ , list(sum_y = sum(y)),by=x]
# merge.data.table executes a fast join on the shared key
merge(df_count,df_sum)
# x count_x sum_y
#1: 1 2 129
#2: 2 2 128
#3: 3 2 154
#4: 4 2 182
#5: 5 2 151
In your example you might write something like this:
count_x <- function(dt) dt[ , list(N = .N) , keyby=x ]
sum_y <- function(dt) dt[ , list(Sum=sum(y)),keyby=x]
# Then merge...
merge(sum_y(df),count_x(df))
# x Sum N
#1: 1 129 2
#2: 2 128 2
#3: 3 154 2
#4: 4 182 2
#5: 5 151 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With