I have a data table that defines the start and end coordinates for a set of sequnces. For example: <pre class="prettyprint"><code>df1 <- data.frame(from = c(7, 22, 35, 21, 50), to = c(13, 29, 43, 31, 60)) </code></pre> Given start and end coordinates (ie 1 and 100), I am trying to identify all integers not covered by the sequences, with the same output format. For example: <pre class="prettyprint"><code>df2 <- data.frame(from = c(1, 14, 32, 44, 61), to = c(6, 20, 34, 49, 100)) </code></pre> Here is my current attempt, in which I vectorise the sequences in df1 and then identify all integers that do not match the sequence 1:100. <pre class="prettyprint"><code>seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to")) seq <- c(1:100) df1_int <- unlist(seq2(from = df1$from, to = df1$to)) df1_int <- unique(df1_int) df2_int <- seq[!seq %in% df1_int] all(diff(df2_int) == 1) </code></pre> However, this method is too slow for the dataset I want to apply it to (~100,000,000 integers), and I do not know how to reformat the vector df2_int into a dataframe in the format of df2. Any help will be greatly appreciated! NB: The sequences in df1 do not always start with the lowest integer (eg a sequence could run from 13 to 7, as opposed to from 7 to 13). There could also be sequences with only one integer (eg from 7 to 7).

Borrowing idea from How to flatten / merge overlapping time periods but in a <code>data.table</code> approach instead: <pre class="prettyprint"><code>library(data.table) setDT(df1) setorder(df1, from, to) maxn <- 100L #see linked post df1[, g := c(0, cumsum(shift(from, -1L) > cummax(to))[-.N])] #get desired output df1[, .(from=max(to)+1L, to=min(from)-1L), by=.(g)][, .(from=c(1L, from), to=c(to, maxn))] </code></pre> Hopefully, this is fast enough for your actual dataset with 100mio integers.

How to identify all sequential numbers not covered by 'to' and 'from' positions?

Tags:

r

large-data

sequence

I have a data table that defines the start and end coordinates for a set of sequnces. For example:

df1 <- data.frame(from = c(7, 22, 35, 21, 50),
              to = c(13, 29, 43, 31, 60))

Given start and end coordinates (ie 1 and 100), I am trying to identify all integers not covered by the sequences, with the same output format. For example:

df2 <- data.frame(from = c(1, 14, 32, 44, 61),
              to = c(6, 20, 34, 49, 100))

Here is my current attempt, in which I vectorise the sequences in df1 and then identify all integers that do not match the sequence 1:100.

seq2 <- Vectorize(seq.default, vectorize.args = c("from", "to"))
seq <- c(1:100)
df1_int <- unlist(seq2(from = df1$from, to = df1$to))
df1_int <- unique(df1_int)
df2_int <- seq[!seq %in% df1_int]
all(diff(df2_int) == 1)

However, this method is too slow for the dataset I want to apply it to (~100,000,000 integers), and I do not know how to reformat the vector df2_int into a dataframe in the format of df2.

Any help will be greatly appreciated!

NB: The sequences in df1 do not always start with the lowest integer (eg a sequence could run from 13 to 7, as opposed to from 7 to 13). There could also be sequences with only one integer (eg from 7 to 7).

910

asked Apr 16 '19 15:04

Powege

3 Answers

Since you need a fast solution we could attempt a base R approach using setdiff and split. The vectorization we leave to mapply. To find the factors where to split we use findInterval. To get the elements' start and end points of the resulting list we clear with range.

d <- setdiff(1:100, unlist(mapply(seq.default, df1[, 1], df1[, 2])))
t(sapply(split(d, findInterval(d, d[which(c(1, diff(d)) > 1)])), range))
#   [,1] [,2]
# 0    1    6
# 1   14   20
# 2   32   34
# 3   44   49
# 4   61  100

Benchmark

As we can see from the benchmark, we have achieved a pretty fast solution.

Unit: microseconds
         expr      min        lq      mean    median       uq      max neval cld
        purrr 1575.479 1593.2110 1634.3573 1604.9475 1634.033 2028.095   100   b
 findInterval  250.801  256.9245  276.8609  273.3815  281.673  498.285   100  a

answered Oct 18 '22 01:10

jay.sf

Edit: Should have read the question better. This is basically your current approach.

You can pmap over your input with the seq function, and unlist that to get a vector of all values. Then setdiff to get the missing values. Using diff and cumsum you can create a grouping variable for the missing values, grouping them into from-to pairs. Then split the missing value vector by the grouping var and map over that to create one row of output for each group.

library(purrr)

miss <- setdiff(1:100, unlist(pmap(df1, seq)))
i <- 
  miss %>% 
    diff %>% 
    `>`(1) %>% 
    rev %>%
    cumsum %>% 
    rev 

map_df(split(miss, c(i, 0)), ~list(from = head(.x, 1), to = tail(.x, 1))) %>% 
  dplyr::arrange(from)


# # A tibble: 5 x 2
#    from    to
#   <int> <int>
# 1     1     6
# 2    14    20
# 3    32    34
# 4    44    49
# 5    61   100

answered Oct 18 '22 01:10

IceCreamToucan

Borrowing idea from How to flatten / merge overlapping time periods but in a data.table approach instead:

library(data.table)
setDT(df1)
setorder(df1, from, to)

maxn <- 100L    

#see linked post
df1[, g := c(0, cumsum(shift(from, -1L) > cummax(to))[-.N])]

#get desired output
df1[, .(from=max(to)+1L, to=min(from)-1L), by=.(g)][, 
    .(from=c(1L, from), to=c(to, maxn))]

Hopefully, this is fast enough for your actual dataset with 100mio integers.

answered Oct 18 '22 00:10

chinsoon12

Related questions
                            
                                R move whole folder to another directory
                            
                                How to import pandas using R studio
                            
                                GGanimate: geom_text with numeric values animates at decimal numbers instead of integers
                            
                                ggplot: Boxplot by several categorical variables
                            
                                Set the max value in colormap when using scale_color_viridis
                            
                                Enter value from df based on condition across multiple columns into new variable
                            
                                Schema file does not exist in XBRL Parse file
                            
                                R ggplot facet_wrap with different y-axis labels, one values, one percentages
                            
                                In R, is growing a list just as inefficient as growing a vector?
                            
                                How to layout 2 rows followed by 1 column with renderPlot in rmarkdown html_notebook with runtime shiny
                            
                                R shinydashboard: specifying div style width argument as percentage to fit a resizeable JS plot
                            
                                For R: How to exclude some data files based on file language
                            
                                gganimate round values during transition
                            
                                Sankey Diagram in R with networkD3 - row number issues
                            
                                How to bind two lists with same structure?
                            
                                Efficient way to subset data.table based on value in any of selected columns [duplicate]
                            
                                R Shiny - multi-page editable DataTable jumps to row #1 after an edit
                            
                                Regular expression in Rstudio's "Find in files"
                            
                                Saving S4 objects in a list of list
                            
                                Install R packages from the command line

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With