Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need help speeding up a dplyr aggregation

Tags:

r

dplyr

tl.dr. I have an aggregation problem that I haven't seen in documentation before. I manage to get it done, but it is way too slow for the intended application. The data I usually work with have around 500 lines (my gut feeling tells me this isn't much for dplyr) and according to system.time it runs for about 4 s. My dilemma is I want to run it in an optimisation repeatedly and currently I am looking at hours of run time.

Do you see anything where I can shave off some time?

If need be I can also send some data I work with.

Algorithm I have a data set:

sample_dataset <- data_frame( starts = c(1000, 1008, 1017, 2000, 2020, 3000),
                          ends   = c(1009, 1015, 1020, 2015, 2030, 3010),
                          v = list(rep(1,10), rep(2,8),rep(3,4), 
                                   rep(4,16), rep(5,11), rep(6,11)) )

so each line encodes a signal and a start and end index. I want to aggregate all lines which have less than closeness (e.g. 10) distance into an single line. In case it matters starts is ordered.

The output should be:

structure(list(inds = 1:3, starts = c(1000, 2000, 3000), ends = c(1020,
2030, 3010), v = list(c(1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 2, 2, 2,
2, 2, 2, 0, 3, 3, 3, 3), c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 0, 0, 0, 0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), c(6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .Names = c("inds", "starts", "ends",
"v"))

So the first three lines in the original data set are aggregated, line 4 and 5 aggregated, 6 unchanged. For overlaps the numbers should be added, for gaps zeros filled in. The updated starts value is the first starts, the updated ends should be the last ends (suppose I should fix it to the max). But by the way these are generated end should be also sorted. The case that one block is completely surrounded by another block should not occur.

I achieved this by following code:

Code

library(dplyr)

join_lines <- function(dfi) {
  if (nrow(dfi)==1) return(select(dfi,starts,ends, v))
  else 
    with(dfi,{ 
             start <- starts[[1]]
             end <- ends[[length(ends)]]
             vals <- numeric(end-start+1)
             add_val <- function(ddf)
               with(ddf,{ 
                      vals[(starts-start+1) : (ends-start+1)] <<- 
                        vals[(starts-start+1) : (ends-start+1)] + v })
             dfi %>% rowwise() %>% do(tmp=add_val(.))
             data_frame(starts=start, ends=end, v=list(vals))})
}

simplify_semisparse <- function(aframe, closeness = 10){
  aframe %>% 
    mutate( join_pre = lag(ends, default=0)+closeness >= (starts),
           inds = cumsum(!join_pre)
           ) %>%
  group_by(inds) %>% do(join_lines(.)) %>% ungroup()
}    

res <- simplify_semisparse(sample_dataset)

dput(res) # see above

Background

The data I am dealing with is from mass spectrometry. It's very peculiar in that a vector has around 500,000 entries and less than 10% of these are not zeros, a typical spectrum has around 500 such dense blocks. I do need to quickly interpolate values along such a spectrum - my idea was to use approx in the "dense" regions.

Comparison of suggestions

I had the chance of comparing your suggestions.

The results produced by @matt-jewett solution did not agree with my intended ones so I did exclude it.

@jeremycgs solution was closest to my original approach, but also did not produce exactly the same results.

most important is for my the runtime, I am comparing using production data. My original solution took 2.165 s. @tjeremy s suggestion took 0.532 s and @uwe-block 0.012 s.

So wow - I need to learn data.table.

like image 290
bdecaf Avatar asked Jul 04 '17 11:07

bdecaf


2 Answers

Here's how I would do it. Your use of a list in v is not best practice (in my opinion), so I've used tidyr to unnest to a longer dataframe. I've allso left out your 0s - you could add them back in as with a left join or something on the index:

library(tidyr)
sample_dataset %>%
 mutate(grouper = cumsum(c(0, na.omit(starts - lag(starts)))>20), id = row_number()) %>% #add a 'grouping' based on your closeness (20 here) and an id for later
 unnest(v) %>% #unnest v into lines - each v now has a line
 group_by(id) %>% #group by line
 mutate(count = row_number()+starts) %>% #get a 'location' per line
 group_by(grouper, count) %>% #group by the 'location' and group
 summarise(starts = starts[1], ends = ends[n()], v = sum(v)) #sum the v

which gives:

Source: local data frame [58 x 5]
Groups: grouper [?]

   grouper count starts  ends     v
     <int> <dbl>  <dbl> <dbl> <dbl>
1        0  1001   1000  1009     1
2        0  1002   1000  1009     1
3        0  1003   1000  1009     1
4        0  1004   1000  1009     1
5        0  1005   1000  1009     1
6        0  1006   1000  1009     1
7        0  1007   1000  1009     1
8        0  1008   1000  1009     1
9        0  1009   1000  1015     3
10       0  1010   1000  1015     3
# ... with 48 more rows

Then, if you really want, you can fill the missing values with 0 (out here is the output from the above):

filled = out %>% group_by(grouper) %>% do(data.frame(count = seq(from = .$starts[1], to = tail(.$ends,1))))

filled = filled %>% left_join(out, by = c('grouper', 'count'))
filled$v[is.na(filled$v)] = 0

Source: local data frame [63 x 5]
Groups: grouper [?]

   grouper count starts  ends     v
     <int> <dbl>  <dbl> <dbl> <dbl>
1        0  1000     NA    NA     0
2        0  1001   1000  1009     1
3        0  1002   1000  1009     1
4        0  1003   1000  1009     1
5        0  1004   1000  1009     1
6        0  1005   1000  1009     1
7        0  1006   1000  1009     1
8        0  1007   1000  1009     1
9        0  1008   1000  1009     1
10       0  1009   1000  1015     3
# ... with 53 more rows
like image 150
jeremycg Avatar answered Sep 18 '22 11:09

jeremycg


Although the OP has requested to speed up a dplyr code I would like to suggest a data.table solution for performance reasons. Furthermore, non of the other answers posted so far have fully addressed OP's requirements, namely,

  • to keep the structure of sample_data with starts, ends and a list of v values,
  • to aggregate all lines which have less than closeness (e.g. 10) distance into a single line.

The code below tries to comply with all requirements:

library(data.table)   # CRAN versio 1.10.4 used
# define threshold: closeness as defined by OP, max_gap used in code 
closeness <- 10L
max_gap <- closeness - 1L
# coerce to data.table, and key, i.e., sort by starts and ends
DT <- data.table(sample_dataset, key = c("starts", "ends"))
# compute gaps between ends and starts of next row
# identify rows which belong together: inds is advanced if gap is greater threshhold
DT[, gap := starts - shift(ends, fill = -Inf)][, inds := cumsum(gap > max_gap)][]
# close gaps but only within groups
DT0 <- DT[between(gap, 2L, max_gap), .(starts = starts - (gap - 1L), ends = starts - 1L, 
                                       v = Vectorize(rep.int)(0L, gap - 1L), gap, inds)]
# bind rowwise (union in SQL), setkey on result to maintain sort order, 
# remove column gap as no longer needed
DT2 <- setkey(rbind(DT, DT0), starts, ends)[, gap := NULL][]
# aggregate groupwise, pick min/max, combine lists
result <- DT2[, .(starts = min(starts), ends = max(ends), v = list(Reduce(c, v))), by = inds]
# alternative code: pick first/last
result <- DT2[, .(starts = first(starts), ends = last(ends), v = list(Reduce(c, v))), by = inds]
result

produces

   inds starts ends            v
1:    1   1000 1020 1,1,1,1,1,1,
2:    2   2000 2030 4,4,4,4,4,4,
3:    3   3000 3010 6,6,6,6,6,6,

with

result$v
[[1]]
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 0 3 3 3 3

[[2]]
 [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0 0 0 0 5 5 5 5 5 5 5 5 5 5 5

[[3]]
 [1] 6 6 6 6 6 6 6 6 6 6 6

It can be verified that the number of elements in the v vectors is the same, except for the additional zeros added for the intra-group gaps:

# test that all v values are included
# original
sum(lengths(sample_dataset$v))
#[1] 60
# result with additional zeros removed
sum(sapply(result$v, function(x) sum(x > 0)))
#[1] 60

I haven't provided a benchmark because the sample data set is too small.

Data

sample_dataset <- dplyr::data_frame( starts = c(1000, 1008, 1017, 2000, 2020, 3000),
                                     ends   = c(1009, 1015, 1020, 2015, 2030, 3010),
                                     v = list(rep(1,10), rep(2,8),rep(3,4), 
                                              rep(4,16), rep(5,11), rep(6,11)) )
like image 28
Uwe Avatar answered Sep 18 '22 11:09

Uwe