Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimization: splitting dataframe into a list of dataframes, transforming data per row

Preliminaries: this question is mostly of educational value, the actual task at hand is completed, even if the approach is not entirely optimal. My question is whether the code below can be optimized for speed and/or implemented more elegantly. Perhaps using additional packages, such as plyr or reshape. Run on the actual data it takes about 140 seconds, much higher than the simulated data, since some of the original rows contain nothing but NA, and additional checks have to be made. To compare, the simulated data are processed in about 30 seconds.

Conditions: the dataset contains 360 variables, 30 times the set of 12. Let's name them V1_1, V1_2... (first set), V2_1, V2_2 ... (second set) and so forth. Each set of 12 variables contains dichotomous (yes/no) responses, in practice corresponding to a career status. For instance: work (yes/no), study (yes/no) and so forth, in total 12 statuses, repeated 30 times.

Task: the task at hand is to recode each set of 12 dichotomous variables into a single variable with 12 response categories (e.g. work, study... ). Ultimately we should get 30 variables, each with 12 response categories.

Data: I cannot post the actual dataset, but here is a good simulated approximation:

randomRow <- function() {
  # make a row with a single 1 and some NA's
  sample(x=c(rep(0,9),1,NA,NA),size=12,replace=F) 
}

# create a data frame with 12 variables and 1500 cases
makeDf <- function() {
  data <- matrix(NA,ncol=12,nrow=1500)
  for (i in 1:1500) {
    data[i,] <- randomRow()
  }
  return(data)
}

mydata <- NULL

# combine 30 of these dataframes horizontally
for (i in 1:30) {
  mydata <- cbind(mydata,makeDf())
}
mydata <- as.data.frame(mydata) # example data ready

My solution:

# Divide the dataset into a list with 30 dataframes, each with 12 variables
S1 <- lapply(1:30,function(i) {
  Z <- rep(1:30,each=12) # define selection vector
  mydata[Z==i]           # use selection vector to get groups of variables (x12)
})

recodeDf <- function(df) {
  result <- as.numeric(apply(df,1,function(x) {
    if (any(!is.na(df))) which(x == 1) else NA # return the position of "1" per row
  }))                                          # the if/else check is for the real data
  return(result)
}
# Combine individual position vectors into a dataframe
final.df <- as.data.frame(do.call(cbind,lapply(S1,recodeDf)))

All in all, there is a double *apply function, one across the list, the other across the dataframe rows. This makes it a bit slow. Any suggestions? Thanks in advance.

like image 766
Maxim.K Avatar asked Apr 10 '13 18:04

Maxim.K


2 Answers

Here is an approach that is basically instantaneous. (system.time = 0.1 seconds)

se set. The columnMatch component will depend on your data, but if it is every 12 columns, then the following will work.

MYD <- data.table(mydata)
# a new data.table (changed to numeric : Arun)
newDT <- as.data.table(replicate(30, numeric(nrow(MYD)),simplify = FALSE))
# for each column, which values equal 1
whiches <- lapply(MYD, function(x) which(x == 1))
# create a list of column matches (those you wish to aggregate)
columnMatch <- split(names(mydata), rep(1:30,each = 12))
setattr(columnMatch, 'names', names(newDT))

# cycle through all new columns
# and assign the the rows in the new data.table
## Arun: had to generate numeric indices for 
## cycling through 1:12, 13:24 in whiches[[.]]. That was the problem.
for(jj in seq_along(columnMatch)) {
 for(ii in seq_along(columnMatch[[jj]])) {
  set(newDT, j = jj, i = whiches[[ii + 12 * (jj-1)]], value = ii)
 }
}

This would work just as well adding columns by reference to the original.

Note set works on data.frames as well....

like image 190
mnel Avatar answered Oct 08 '22 03:10

mnel


I really like @Arun's matrix multiplication idea. Interestingly, if you compiling R against some OpenBLAS libraries, you could get this to operate in parallel.

However, I wanted to provide you with another, perhaps slower than matrix multiplication, solution that uses your original pattern, but is much faster than your implementation:

# Match is usually faster than which, because it only returns the first match 
# (and therefore won't fail on multiple matches)
# It also neatly handles your *all NA* case
recodeDf2 <- function(df) apply(df,1,match,x=1) 
# You can split your data.frame by column with split.default
# (Using split on data.frame will split-by-row)
S2<-split.default(mydata,rep(1:30,each=12))
final.df2<-lapply(S2,recodeDf2)

If you had a very large data frame, and many processors, you may consider parallelizing this operation with:

library(parallel)
final.df2<-mclapply(S2,recodeDf2,mc.cores=numcores) 
# Where numcores is your number of processors.

Having read @Arun and @mnel, I learned a lot about how to improve this function, by avoiding the coercion to an array, by processing the data.frame by column instead of by row. I don't mean to "steal" an answer here; OP should consider switching the checkbox to @mnel's answer.

I wanted, however, to share a solution that doesn't use data.table, and avoids for. It is still, however, slower than @mnel's solution, albeit slightly.

nograpes2<-function(mydata) {
  test<-function(df) {
    l<-lapply(df,function(x) which(x==1))
    lens<-lapply(l,length)
    rep.int(seq.int(l),times=lens)[order(unlist(l))]
  }
  S2<-split.default(mydata,rep(1:30,each=12))
  data.frame(lapply(S2,test))
}

I would also like to add that @Aaron's approach, using which with arr.ind=TRUE would also be very fast and elegant, if mydata started out as a matrix, rather than a data.frame. Coercion to a matrix is slower than the rest of the function. If speed were an issue, it would be worth considering reading the data in as a matrix in the first place.

like image 31
nograpes Avatar answered Oct 08 '22 03:10

nograpes