Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Quickly Performing Operations on Subsets of a Data Frame, then Re-aggregating the Result Without an Inner Function

Tags:

r

dplyr

We have a very large data frame df that can be split by factors. On each subset of the data frame created by this split, we need to perform an operation to increase the number of rows of that subset until it's a certain length. Afterwards, we rbind the subsets to get a bigger version of df.

Is there a way of doing this quickly without using an inner function?

Let's say our subset operation (in a separate .R file) is:

foo <- function(df) { magic }

We've come up with a few ways of doing this:

1)

df <- split(df, factor)
df <- lapply(df, foo)
rbindlist(df)

2)

assign('list.df', list(), envir=.GlobalEnv) 
assign('i', 1, envir=.GlobalEnv)

dplyr::group_by(df, factor)
dplyr::mutate(df, foo.list(df.col))
df <- rbindlist(list.df)
rm('list.df', envir=.GlobalEnv)
rm('i', envir=.GlobalEnv)

(In a separate file)
foo.list <- function(df.cols) {
    magic; 
    list.df[[i]] <<- magic.df
    i <<- i + 1
    return(dummy)
}

The issue with the first approach is time. The lapply simply takes too long to really be desirable (on the order of an hour with our data set).

The issue with the second approach is the extremely undesirable side-effect of tampering with the user's global environment. It's significantly faster, but this is something we'd rather avoid if we can.

We've also tried passing in the list and count variables and then trying to substitute them with the variables in the parent environment (A sort of hack to get around R's lack of pass-by-reference).

We've looked at a number of possibly-relevant SO questions (R applying a function to a subset of a data frame, Calculations on subsets of a data frame, R: Pass by reference, e.t.c.) but none of them dealt with our question too well.

If you want to run code, here's something you can copy and paste:

 x <- runif(n=10, min=0, max=3)
 y <- sample(x=10, replace=FALSE)
 factors <- runif(n=10, min=0, max=2)
 factors <- floor(factors)
 df <- data.frame(factors, x, y)

df now looks like this (length 10): Original df

 ## We group by factor, then run foo on the groups.

 foo <- function(df.subset) {
   min <- min(df.subset$y)
   max <- max(df.subset$y)

   ## We fill out df.subset to have everything between the min and
   ## max values of y. Then we assign the old values of df.subset
   ## to the corresponding spots.

   df.fill <- data.frame(x=rep(0, max-min+1),
                         y=min:max,
                         factors=rep(df.subset$factors[1], max-min+1))
   df.fill$x[which(df.subset$y %in%(min:max))] <- df.subset$x
   df.fill
 }

So I can take my sample code in the first approach to build a new df (length 18): New df

like image 527
Ryan K. Avatar asked Dec 19 '22 20:12

Ryan K.


2 Answers

Using data.table this doesn't take long due to speedy functionality. If you can, rewrite your function to work with specific variables. The split-apply-combine processing may get a performance boost:

library(data.table)
system.time(
df2 <- setDT(df)[,foo(df), factors]
)
#   user  system elapsed 
#   1.63    0.39    2.03
like image 166
Pierre L Avatar answered May 24 '23 02:05

Pierre L


Another variation using data.table.. First get the min(y):max(y) part and then join+update:

require(data.table)
ans = setDT(df)[, .(x=0, y=min(y):max(y)), by=factors
              ][df, x := i.x, on=c("factors", "y")][]
ans
#     factors          x  y
#  1:       0 1.25104362  1
#  2:       0 0.16729068  2
#  3:       0 0.00000000  3
#  4:       0 0.02533907  4
#  5:       0 0.00000000  5
#  6:       0 0.00000000  6
#  7:       0 1.80547980  7
#  8:       1 0.34043937  3
#  9:       1 0.00000000  4
# 10:       1 1.51742163  5
# 11:       1 0.15709287  6
# 12:       1 0.00000000  7
# 13:       1 1.26282241  8
# 14:       1 2.88292354  9
# 15:       1 1.78573288 10
like image 37
Arun Avatar answered May 24 '23 04:05

Arun