tl.dr. I have an aggregation problem that I haven't seen in documentation before. I manage to get it done, but it is way too slow for the intended application. The data I usually work with have around 500 lines (my gut feeling tells me this isn't much for dplyr) and according to system.time
it runs for about 4 s. My dilemma is I want to run it in an optimisation repeatedly and currently I am looking at hours of run time.
Do you see anything where I can shave off some time?
If need be I can also send some data I work with.
Algorithm I have a data set:
sample_dataset <- data_frame( starts = c(1000, 1008, 1017, 2000, 2020, 3000),
ends = c(1009, 1015, 1020, 2015, 2030, 3010),
v = list(rep(1,10), rep(2,8),rep(3,4),
rep(4,16), rep(5,11), rep(6,11)) )
so each line encodes a signal and a start and end index. I want to aggregate all lines which have less than closeness
(e.g. 10) distance into an single line. In case it matters starts
is ordered.
The output should be:
structure(list(inds = 1:3, starts = c(1000, 2000, 3000), ends = c(1020,
2030, 3010), v = list(c(1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 2, 2, 2,
2, 2, 2, 0, 3, 3, 3, 3), c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 0, 0, 0, 0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), c(6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .Names = c("inds", "starts", "ends",
"v"))
So the first three lines in the original data set are aggregated, line 4 and 5 aggregated, 6 unchanged. For overlaps the numbers should be added, for gaps zeros filled in. The updated starts value is the first starts, the updated ends should be the last ends (suppose I should fix it to the max). But by the way these are generated end should be also sorted. The case that one block is completely surrounded by another block should not occur.
I achieved this by following code:
Code
library(dplyr)
join_lines <- function(dfi) {
if (nrow(dfi)==1) return(select(dfi,starts,ends, v))
else
with(dfi,{
start <- starts[[1]]
end <- ends[[length(ends)]]
vals <- numeric(end-start+1)
add_val <- function(ddf)
with(ddf,{
vals[(starts-start+1) : (ends-start+1)] <<-
vals[(starts-start+1) : (ends-start+1)] + v })
dfi %>% rowwise() %>% do(tmp=add_val(.))
data_frame(starts=start, ends=end, v=list(vals))})
}
simplify_semisparse <- function(aframe, closeness = 10){
aframe %>%
mutate( join_pre = lag(ends, default=0)+closeness >= (starts),
inds = cumsum(!join_pre)
) %>%
group_by(inds) %>% do(join_lines(.)) %>% ungroup()
}
res <- simplify_semisparse(sample_dataset)
dput(res) # see above
Background
The data I am dealing with is from mass spectrometry. It's very peculiar in that a vector has around 500,000 entries and less than 10% of these are not zeros, a typical spectrum has around 500 such dense blocks. I do need to quickly interpolate values along such a spectrum - my idea was to use approx
in the "dense" regions.
Comparison of suggestions
I had the chance of comparing your suggestions.
The results produced by @matt-jewett solution did not agree with my intended ones so I did exclude it.
@jeremycgs solution was closest to my original approach, but also did not produce exactly the same results.
most important is for my the runtime, I am comparing using production data. My original solution took 2.165 s. @tjeremy s suggestion took 0.532 s and @uwe-block 0.012 s.
So wow - I need to learn data.table.
Here's how I would do it. Your use of a list in v is not best practice (in my opinion), so I've used tidyr
to unnest to a longer dataframe. I've allso left out your 0s - you could add them back in as with a left join or something on the index:
library(tidyr)
sample_dataset %>%
mutate(grouper = cumsum(c(0, na.omit(starts - lag(starts)))>20), id = row_number()) %>% #add a 'grouping' based on your closeness (20 here) and an id for later
unnest(v) %>% #unnest v into lines - each v now has a line
group_by(id) %>% #group by line
mutate(count = row_number()+starts) %>% #get a 'location' per line
group_by(grouper, count) %>% #group by the 'location' and group
summarise(starts = starts[1], ends = ends[n()], v = sum(v)) #sum the v
which gives:
Source: local data frame [58 x 5]
Groups: grouper [?]
grouper count starts ends v
<int> <dbl> <dbl> <dbl> <dbl>
1 0 1001 1000 1009 1
2 0 1002 1000 1009 1
3 0 1003 1000 1009 1
4 0 1004 1000 1009 1
5 0 1005 1000 1009 1
6 0 1006 1000 1009 1
7 0 1007 1000 1009 1
8 0 1008 1000 1009 1
9 0 1009 1000 1015 3
10 0 1010 1000 1015 3
# ... with 48 more rows
Then, if you really want, you can fill the missing values with 0 (out
here is the output from the above):
filled = out %>% group_by(grouper) %>% do(data.frame(count = seq(from = .$starts[1], to = tail(.$ends,1))))
filled = filled %>% left_join(out, by = c('grouper', 'count'))
filled$v[is.na(filled$v)] = 0
Source: local data frame [63 x 5]
Groups: grouper [?]
grouper count starts ends v
<int> <dbl> <dbl> <dbl> <dbl>
1 0 1000 NA NA 0
2 0 1001 1000 1009 1
3 0 1002 1000 1009 1
4 0 1003 1000 1009 1
5 0 1004 1000 1009 1
6 0 1005 1000 1009 1
7 0 1006 1000 1009 1
8 0 1007 1000 1009 1
9 0 1008 1000 1009 1
10 0 1009 1000 1015 3
# ... with 53 more rows
Although the OP has requested to speed up a dplyr
code I would like to suggest a data.table
solution for performance reasons. Furthermore, non of the other answers posted so far have fully addressed OP's requirements, namely,
sample_data
with starts
, ends
and a list of v
values,The code below tries to comply with all requirements:
library(data.table) # CRAN versio 1.10.4 used
# define threshold: closeness as defined by OP, max_gap used in code
closeness <- 10L
max_gap <- closeness - 1L
# coerce to data.table, and key, i.e., sort by starts and ends
DT <- data.table(sample_dataset, key = c("starts", "ends"))
# compute gaps between ends and starts of next row
# identify rows which belong together: inds is advanced if gap is greater threshhold
DT[, gap := starts - shift(ends, fill = -Inf)][, inds := cumsum(gap > max_gap)][]
# close gaps but only within groups
DT0 <- DT[between(gap, 2L, max_gap), .(starts = starts - (gap - 1L), ends = starts - 1L,
v = Vectorize(rep.int)(0L, gap - 1L), gap, inds)]
# bind rowwise (union in SQL), setkey on result to maintain sort order,
# remove column gap as no longer needed
DT2 <- setkey(rbind(DT, DT0), starts, ends)[, gap := NULL][]
# aggregate groupwise, pick min/max, combine lists
result <- DT2[, .(starts = min(starts), ends = max(ends), v = list(Reduce(c, v))), by = inds]
# alternative code: pick first/last
result <- DT2[, .(starts = first(starts), ends = last(ends), v = list(Reduce(c, v))), by = inds]
result
produces
inds starts ends v 1: 1 1000 1020 1,1,1,1,1,1, 2: 2 2000 2030 4,4,4,4,4,4, 3: 3 3000 3010 6,6,6,6,6,6,
with
result$v
[[1]] [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 0 3 3 3 3 [[2]] [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0 0 0 0 5 5 5 5 5 5 5 5 5 5 5 [[3]] [1] 6 6 6 6 6 6 6 6 6 6 6
It can be verified that the number of elements in the v
vectors is the same, except for the additional zeros added for the intra-group gaps:
# test that all v values are included
# original
sum(lengths(sample_dataset$v))
#[1] 60
# result with additional zeros removed
sum(sapply(result$v, function(x) sum(x > 0)))
#[1] 60
I haven't provided a benchmark because the sample data set is too small.
sample_dataset <- dplyr::data_frame( starts = c(1000, 1008, 1017, 2000, 2020, 3000),
ends = c(1009, 1015, 1020, 2015, 2030, 3010),
v = list(rep(1,10), rep(2,8),rep(3,4),
rep(4,16), rep(5,11), rep(6,11)) )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With