Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently compute proportions of one data frame from another

I have this data.frame:

set.seed(1)
df <- cbind(matrix(rnorm(26,100),26,100),data.frame(id=LETTERS,parent.id=sample(letters[1:5],26,replace = T),stringsAsFactors = F))

Each row is 100 measurements from a certain subject (designated by id), which is associated with a parent ID (designated by parent.id). The relationship between parent.id and id is one-to-many.

I'm looking for a fast way to get the fraction of each df$id (for each of its 100 measurements) out the measurements of its parent.id. Meaning that for each id in df$id I want to divide each of its 100 measurements by the sum of its measurements across all df$id's which correspond to its df$parent.id.

What I'm trying is:

sum.df <- dplyr::select(df,-id) %>% dplyr::group_by(parent.id) %>% dplyr::summarise_all(sum)

fraction.df <- do.call(rbind,lapply(df$id,function(i){
  pid <- dplyr::filter(df,id == i)$parent.id
  (dplyr::filter(df,id == i) %>% dplyr::select(-id,-parent.id))/
    (dplyr::filter(sum.df,parent.id == pid) %>% dplyr::select(-parent.id))
}))

But for the real dimensions of my data: length(df$id) = 10,000 with 1,024 measurements, this is not fast enough.

Any idea how to improve this, ideally using dplyr functions?

like image 903
user1701545 Avatar asked Nov 25 '25 20:11

user1701545


2 Answers

Lets compare these options with microbenchmark, all using the new definition for the dataset in @Sathish's answer:

OP method:

Units: seconds
      min      lq     mean   median       uq      max neval
 1.423583 1.48449 1.602001 1.581978 1.670041 2.275105   100

@Sathish method speeds it up by a factor of about 5. This is valuable, to be sure

Units: milliseconds
      min      lq     mean   median       uq      max neval
 299.3581 334.787 388.5283 363.0363 398.6714 951.4654   100 

One possible base R implementation below, using principles of efficient R code, improves things by a factor of about 65 (24 milliseconds, vs 1,582 milliseconds):

Units: milliseconds
     min       lq     mean   median       uq      max neval
21.49046 22.59205 24.97197 23.81264 26.36277 34.72929   100

Here's the base R implementation. As is the case for the OP's implementation, the parent.id and id columns are not included in the resulting structure (here fractions). fractions is a matrix with rows ordered according to sort(interaction(df$id, df$parent.id, drop = TRUE)).

values <- df[1:100]
parents <- split(values, df$parent.id)
sums <- vapply(parents, colSums, numeric(100), USE.NAMES = FALSE)
fractions <- matrix(0, 26, 100)
f_count <- 0
for (p_count in seq_along(parents)){
  parent <- as.matrix(parents[[p_count]])
  dimnames(parent) <- NULL
  n <- nrow(parent)
  for (p_row in seq_len(nrow(parent))){
    fractions[(f_count + p_row),] <- parent[p_row,] / sums[,p_count]
  }
  f_count <- f_count + p_row
}

Note: there's still room for improvement. split() is not particularly efficient.

Note 2: What "principles of efficient R code" were used?

  1. Get rid of names whenever you can
  2. It's faster to find things in a matrix than a data frame
  3. Don't be afraid of for loops for efficiency, provided you're not growing an object
  4. Prefer vapply to the other apply family functions.
like image 150
De Novo Avatar answered Nov 27 '25 10:11

De Novo


The problem with your data is all rows are duplicate of each other, so I changed it slightly to reflect different values in the dataset.

Data:

set.seed(1L)
df <- cbind(matrix(rnorm(2600), nrow = 26, ncol = 100),data.frame(id=LETTERS,parent.id=sample(letters[1:5],26,replace = T),stringsAsFactors = F))

Code:

library('data.table')
setDT(df)  # assign data.table class by reference

# compute sum for each `parent.id` for each column (100 columns)
sum_df <- df[, .SD, .SDcols = which(colnames(df) != 'id' )][, lapply(.SD, sum ), by = .(parent.id ) ] 

# get column names for sum_df and df which are sorted for consistency
no_pid_id_df  <- gtools::mixedsort( colnames(df)[ ! ( colnames(df) %in% c( 'id', 'parent.id' ) ) ] )
no_pid_sum_df <-  gtools::mixedsort( colnames(sum_df)[ colnames(sum_df) != 'parent.id' ] )

# match the `parent.id` for each `id` and then divide its value by the value of `sum_df`.
df[, .( props = { 
  pid <- parent.id
  unlist( .SD[, .SD, .SDcols = no_pid_id_df ] ) /
    unlist( sum_df[ parent.id == pid, ][, .SD, .SDcols = no_pid_sum_df ] )
  }, parent.id ), by = .(id)]

Output:

#       id       props parent.id
#    1:  A -0.95157186         e
#    2:  A  0.06105359         e
#    3:  A -0.42267771         e
#    4:  A -0.03376174         e
#    5:  A -0.16639600         e
# ---                         
# 2596:  Z  2.34696158         e
# 2597:  Z  0.23762369         e
# 2598:  Z  0.60068440         e
# 2599:  Z  0.14192337         e
# 2600:  Z  0.01292592         e

Benchmark:

library('microbenchmark')
microbenchmark( sathish(), frank(), dan())
# Unit: milliseconds
#     expr         min         lq       mean    median         uq       max neval cld
# sathish() 404.450219 413.456675 433.656279 420.46044 429.876085 593.44202   100   c
# frank()     2.035302   2.304547   2.707019   2.47257   2.622025  18.31409   100   a  
# dan()      17.396981  18.230982  19.316653  18.59737  19.700394  27.13146   100   b 
like image 45
Sathish Avatar answered Nov 27 '25 10:11

Sathish



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!