I have this data.frame:
set.seed(1)
df <- cbind(matrix(rnorm(26,100),26,100),data.frame(id=LETTERS,parent.id=sample(letters[1:5],26,replace = T),stringsAsFactors = F))
Each row is 100 measurements from a certain subject (designated by id), which is associated with a parent ID (designated by parent.id). The relationship between parent.id and id is one-to-many.
I'm looking for a fast way to get the fraction of each df$id (for each of its 100 measurements) out the measurements of its parent.id. Meaning that for each id in df$id I want to divide each of its 100 measurements by the sum of its measurements across all df$id's which correspond to its df$parent.id.
What I'm trying is:
sum.df <- dplyr::select(df,-id) %>% dplyr::group_by(parent.id) %>% dplyr::summarise_all(sum)
fraction.df <- do.call(rbind,lapply(df$id,function(i){
pid <- dplyr::filter(df,id == i)$parent.id
(dplyr::filter(df,id == i) %>% dplyr::select(-id,-parent.id))/
(dplyr::filter(sum.df,parent.id == pid) %>% dplyr::select(-parent.id))
}))
But for the real dimensions of my data: length(df$id) = 10,000 with 1,024 measurements, this is not fast enough.
Any idea how to improve this, ideally using dplyr functions?
Lets compare these options with microbenchmark, all using the new definition for the dataset in @Sathish's answer:
OP method:
Units: seconds
min lq mean median uq max neval
1.423583 1.48449 1.602001 1.581978 1.670041 2.275105 100
@Sathish method speeds it up by a factor of about 5. This is valuable, to be sure
Units: milliseconds
min lq mean median uq max neval
299.3581 334.787 388.5283 363.0363 398.6714 951.4654 100
One possible base R implementation below, using principles of efficient R code, improves things by a factor of about 65 (24 milliseconds, vs 1,582 milliseconds):
Units: milliseconds
min lq mean median uq max neval
21.49046 22.59205 24.97197 23.81264 26.36277 34.72929 100
Here's the base R implementation. As is the case for the OP's implementation, the parent.id and id columns are not included in the resulting structure (here fractions). fractions is a matrix with rows ordered according to sort(interaction(df$id, df$parent.id, drop = TRUE)).
values <- df[1:100]
parents <- split(values, df$parent.id)
sums <- vapply(parents, colSums, numeric(100), USE.NAMES = FALSE)
fractions <- matrix(0, 26, 100)
f_count <- 0
for (p_count in seq_along(parents)){
parent <- as.matrix(parents[[p_count]])
dimnames(parent) <- NULL
n <- nrow(parent)
for (p_row in seq_len(nrow(parent))){
fractions[(f_count + p_row),] <- parent[p_row,] / sums[,p_count]
}
f_count <- f_count + p_row
}
Note: there's still room for improvement. split() is not particularly efficient.
Note 2: What "principles of efficient R code" were used?
vapply to the other apply family functions.The problem with your data is all rows are duplicate of each other, so I changed it slightly to reflect different values in the dataset.
Data:
set.seed(1L)
df <- cbind(matrix(rnorm(2600), nrow = 26, ncol = 100),data.frame(id=LETTERS,parent.id=sample(letters[1:5],26,replace = T),stringsAsFactors = F))
Code:
library('data.table')
setDT(df) # assign data.table class by reference
# compute sum for each `parent.id` for each column (100 columns)
sum_df <- df[, .SD, .SDcols = which(colnames(df) != 'id' )][, lapply(.SD, sum ), by = .(parent.id ) ]
# get column names for sum_df and df which are sorted for consistency
no_pid_id_df <- gtools::mixedsort( colnames(df)[ ! ( colnames(df) %in% c( 'id', 'parent.id' ) ) ] )
no_pid_sum_df <- gtools::mixedsort( colnames(sum_df)[ colnames(sum_df) != 'parent.id' ] )
# match the `parent.id` for each `id` and then divide its value by the value of `sum_df`.
df[, .( props = {
pid <- parent.id
unlist( .SD[, .SD, .SDcols = no_pid_id_df ] ) /
unlist( sum_df[ parent.id == pid, ][, .SD, .SDcols = no_pid_sum_df ] )
}, parent.id ), by = .(id)]
Output:
# id props parent.id
# 1: A -0.95157186 e
# 2: A 0.06105359 e
# 3: A -0.42267771 e
# 4: A -0.03376174 e
# 5: A -0.16639600 e
# ---
# 2596: Z 2.34696158 e
# 2597: Z 0.23762369 e
# 2598: Z 0.60068440 e
# 2599: Z 0.14192337 e
# 2600: Z 0.01292592 e
Benchmark:
library('microbenchmark')
microbenchmark( sathish(), frank(), dan())
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# sathish() 404.450219 413.456675 433.656279 420.46044 429.876085 593.44202 100 c
# frank() 2.035302 2.304547 2.707019 2.47257 2.622025 18.31409 100 a
# dan() 17.396981 18.230982 19.316653 18.59737 19.700394 27.13146 100 b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With