I have a data table that defines the start and end coordinates for a set of sequnces. For example:
df1 <- data.frame(from = c(7, 22, 35, 21, 50),
to = c(13, 29, 43, 31, 60))
Given start and end coordinates (ie 1 and 100), I am trying to identify all integers not covered by the sequences, with the same output format. For example:
df2 <- data.frame(from = c(1, 14, 32, 44, 61),
to = c(6, 20, 34, 49, 100))
Here is my current attempt, in which I vectorise the sequences in df1 and then identify all integers that do not match the sequence 1:100.
seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
seq <- c(1:100)
df1_int <- unlist(seq2(from = df1$from, to = df1$to))
df1_int <- unique(df1_int)
df2_int <- seq[!seq %in% df1_int]
all(diff(df2_int) == 1)
However, this method is too slow for the dataset I want to apply it to (~100,000,000 integers), and I do not know how to reformat the vector df2_int into a dataframe in the format of df2.
Any help will be greatly appreciated!
NB: The sequences in df1 do not always start with the lowest integer (eg a sequence could run from 13 to 7, as opposed to from 7 to 13). There could also be sequences with only one integer (eg from 7 to 7).
In a blank cell, enter the formula of =IF(A3-A2=1,"","Missing"), and press the Enter key. In this case, we enter the formula in Cell B2. If there is no missing numbers, this formula will return nothing; if missing numbers exist, it will return the text of "Missing" in active cell.
Add 1 to any number in the list except for the last number(6) and check if the result is in the list. For the last number (6) which is the greatest one, pick the number before it(5) and add 1 and check if the result(6) is in the list.
Sequential Numbering is a popular feature on custom-printed forms Sequential Numbering, also known as Consecutive Numbering, refers to the printing of ascending or descending identification numbers so that each printed unit receives its own unique number.
Since you need a fast solution we could attempt a base R approach using setdiff
and split
. The vectorization we leave to mapply
. To find the factors where to split
we use findInterval
. To get the elements' start and end points of the resulting list we clear with range
.
d <- setdiff(1:100, unlist(mapply(seq.default, df1[, 1], df1[, 2])))
t(sapply(split(d, findInterval(d, d[which(c(1, diff(d)) > 1)])), range))
# [,1] [,2]
# 0 1 6
# 1 14 20
# 2 32 34
# 3 44 49
# 4 61 100
Benchmark
As we can see from the benchmark, we have achieved a pretty fast solution.
Unit: microseconds
expr min lq mean median uq max neval cld
purrr 1575.479 1593.2110 1634.3573 1604.9475 1634.033 2028.095 100 b
findInterval 250.801 256.9245 276.8609 273.3815 281.673 498.285 100 a
Edit: Should have read the question better. This is basically your current approach.
You can pmap
over your input with the seq
function, and unlist
that to get a vector of all values. Then setdiff
to get the missing values. Using diff
and cumsum
you can create a grouping variable for the missing values, grouping them into from-to pairs. Then split the missing value vector by the grouping var and map
over that to create one row of output for each group.
library(purrr)
miss <- setdiff(1:100, unlist(pmap(df1, seq)))
i <-
miss %>%
diff %>%
`>`(1) %>%
rev %>%
cumsum %>%
rev
map_df(split(miss, c(i, 0)), ~list(from = head(.x, 1), to = tail(.x, 1))) %>%
dplyr::arrange(from)
# # A tibble: 5 x 2
# from to
# <int> <int>
# 1 1 6
# 2 14 20
# 3 32 34
# 4 44 49
# 5 61 100
Borrowing idea from How to flatten / merge overlapping time periods but in a data.table
approach instead:
library(data.table)
setDT(df1)
setorder(df1, from, to)
maxn <- 100L
#see linked post
df1[, g := c(0, cumsum(shift(from, -1L) > cummax(to))[-.N])]
#get desired output
df1[, .(from=max(to)+1L, to=min(from)-1L), by=.(g)][,
.(from=c(1L, from), to=c(to, maxn))]
Hopefully, this is fast enough for your actual dataset with 100mio integers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With