Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter rows based on combined set of values in a string

In R, I have the following dataframe with the column "overlap" listing rows that have overlapping values on some other column.

df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", 
                              "5,6", "5,6,7", "6,7", 
                              "8,9", "8,9,10", "9,10", 
                              "11,12,13", "11,12,13", 
                              "11,12,13,14", "13,14", 
                              "15,16", "15,16,17", "16,17", 
                              "18,19", "18,19,20", "19,20"))

df
         overlap
  1        1,2,3
  2        1,2,3
  3      1,2,3,4
  4          3,4
  5          5,6
  6        5,6,7
  7          6,7
  8          8,9
  9       8,9,10
  10        9,10
  11    11,12,13
  12    11,12,13
  13 11,12,13,14
  14       13,14
  15       15,16
  16    15,16,17
  17       16,17
  18       18,19
  19    18,19,20
  20       19,20

I would like to identify rows with common values, even if those values are not in all rows, and then keep only 1 of the rows. For example, rows 1-4 contain the combined set 1,2,3,4 and I would like to keep only one of these rows. If we keep the first row, the resulting df would be:

  1        1,2,3
  5          5,6
  8          8,9
  11    11,12,13
  15       15,16
  18       18,19

I've searched many other solutions on here and none include uneven rows lengths, and which is vital as the full data can have rows with dozens of values.

like image 546
bcrew Avatar asked Oct 16 '25 04:10

bcrew


2 Answers

One option for this particular example data is to create an igraph graph from row overlaps, detect connected components in resulting graph and use component's cluster id as a grouping variable. From there we can pick the first row from every group.

library(dplyr)
library(igraph)

df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", 
                              "5,6", "5,6,7", "6,7", 
                              "8,9", "8,9,10", "9,10", 
                              "11,12,13", "11,12,13", 
                              "11,12,13,14", "13,14", 
                              "15,16", "15,16,17", "16,17", 
                              "18,19", "18,19,20", "19,20"))


df |> 
  mutate(id = row_number(), .before = 1) |> 
  group_by(
    g_clust = 
      strsplit(overlap, ",") |> 
      # either create a directed graph or set duplicate = FALSE for 
      # corner cases like `overlap = c("1", "1,2,3", ...)`
      graph_from_adj_list(mode = "all", duplicate = FALSE) |> 
      components() |> 
      getElement("membership")
    ) |> 
  slice_head(n = 1)
#> # A tibble: 6 × 3
#> # Groups:   g_clust [6]
#>      id overlap  g_clust
#>   <int> <chr>      <dbl>
#> 1     1 1,2,3          1
#> 2     5 5,6            2
#> 3     8 8,9            3
#> 4    11 11,12,13       4
#> 5    15 15,16          5
#> 6    18 18,19          6

Overlaps graph for reference:

strsplit(df$overlap, ",") |> 
  graph_from_adj_list(mode = "all", duplicate = FALSE) |>
  plot()

graph

like image 173
margusl Avatar answered Oct 17 '25 18:10

margusl


We can try {ivs}:

x = vapply(strsplit(unique(overlap), ","), 
           \(i) as.numeric(i[c(1, length(i))]), numeric(2))

library(ivs)
int = iv_groups(iv(x[1, ], x[2, ]))

giving

> as.data.frame(int)
         y
1   [1, 4)
2   [5, 7)
3  [8, 10)
4 [11, 14)
5 [15, 17)
6 [18, 20)

The vapply is a bit redundant as we call as.numeric several times. Do you really want comma-separated integers stored as character?

transform(as.data.frame(int), 
          s = Vectorize(\(x, y) toString(x:y))(iv_start(int), iv_end(int) - 1))

giving

       int          s
1   [1, 4)    1, 2, 3
2   [5, 7)       5, 6
3  [8, 10)       8, 9
4 [11, 14) 11, 12, 13
5 [15, 17)     15, 16
6 [18, 20)     18, 19

Edit

@Chris is right in the comment below. I should add some explanation.

(1) Re-structure the data. Split the strings, find first and last value, coerce character to numeric.

x = # we assign the output of the pipe x |> ... |> ... to x 
  overlap |> # access the data
  unique() |> # get rid of duplicates (not needed)
  strsplit(",") |> # split on ",", we might want to add fixed=TRUE
  # returns a list of character vectors, so we iterate over it with lapply
  lapply(\(x) x[c(1, length(x))]) |> # get first and last element
  # "1,2,3" ---> "1" "2" "3" has length 3 while "3" "4" has length 2
  do.call(what="rbind") |> # list to 2-column matrix 
  type.convert(as.is=TRUE) # we coerce from character to numeric 

gives

> x
      [,1] [,2]
 [1,]    1    3
 [2,]    1    4
 [3,]    3    4
 [4,]    5    6
 [5,]    5    7
 [6,]    6    7
 [7,]    8    9
 [8,]    8   10
 [9,]    9   10
[10,]   11   13
[11,]   11   14
[12,]   13   14
[13,]   15   16
[14,]   15   17
[15,]   16   17
[16,]   18   19
[17,]   18   20
[18,]   19   20
> 
> # of 
> class(x)
[1] "matrix" "array" 

This obviously assumes that the lowest integer is in first position and the highest in last--a reasonable assumption? Otherwise we should coerce to numeric first and apply range on each list element.

(2) To create interval vectors, we use iv(). From it's documentation (cp. help(iv)):

iv() creates an interval vector from start and end vectors. This is how you will typically create interval vectors, and is often used with columns in a data frame.

i.e.

> library(ivs)
> y = iv(x[, 1], x[, 2])
> y
<iv<integer>[18]>
 [1] [1, 3)   [1, 4)   [3, 4)   [5, 6)   [5, 7)   [6, 7)   [8, 9)   [8, 10) 
 [9] [9, 10)  [11, 13) [11, 14) [13, 14) [15, 16) [15, 17) [16, 17) [18, 19)
[17] [18, 20) [19, 20)

Finally, we use iv_groups. From help(iv_groups):

This family of functions revolves around grouping overlapping intervals within a single iv. When multiple overlapping intervals are grouped together they result in a wider interval containing the smallest iv_start() and the largest iv_end() of the overlaps.

> z = iv_groups(y)
> z
<iv<integer>[6]>
[1] [1, 4)   [5, 7)   [8, 10)  [11, 14) [15, 17) [18, 20)

(3) It seems like your desired output is a character vector, where all integer sequences of the remaining intervals are pasted together.

To achieve this, we use ivs::iv_start() and ivs::iv_end() to access the boundaries of each interval. Now we would like to generate regular sequences. Unfortunately, : is not vectorised, hence we introduce:

seq2str = Vectorize(\(from, to) toString(from:to))

toSpring() is a wrapper for format(), it's help page states

The default method first converts x to character and then concatenates the elements separated by ", ".

Applying our custom function to each start and end gives

> seq2str(from = iv_start(z), to = iv_end(z) - 1)
[1] "1, 2, 3"    "5, 6"       "8, 9"       "11, 12, 13" "15, 16"     "18, 19"      

Note

You can also use this shorter version.

# input data 
overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", "5,6", "5,6,7", "6,7", "8,9", 
            "8,9,10", "9,10", "11,12,13", "11,12,13", "11,12,13,14", "13,14",
            "15,16", "15,16,17", "16,17", "18,19", "18,19,20", "19,20")
# piped version
library(ivs)
overlap |> 
  strsplit(",") |> 
  lapply(\(x) x[c(1, length(x))]) |> 
  do.call(what="rbind") |>
  type.convert(as.is=TRUE) |>
  list(. = _) |>
  with(iv(.[, 1], .[, 2])) |>
  iv_groups() |>
  list(. = _) |>
  with(Vectorize(\(from, to) toString(from:to))(iv_start(.), to = iv_end(.) - 1))

where we use a trick to be able to make use of the forward pipe operator from base.

like image 28
Friede Avatar answered Oct 17 '25 17:10

Friede



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!