Filter rows based on combined set of values in a string

Question

In R, I have the following dataframe with the column "overlap" listing rows that have overlapping values on some other column.

df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", 
                              "5,6", "5,6,7", "6,7", 
                              "8,9", "8,9,10", "9,10", 
                              "11,12,13", "11,12,13", 
                              "11,12,13,14", "13,14", 
                              "15,16", "15,16,17", "16,17", 
                              "18,19", "18,19,20", "19,20"))

df
         overlap
  1        1,2,3
  2        1,2,3
  3      1,2,3,4
  4          3,4
  5          5,6
  6        5,6,7
  7          6,7
  8          8,9
  9       8,9,10
  10        9,10
  11    11,12,13
  12    11,12,13
  13 11,12,13,14
  14       13,14
  15       15,16
  16    15,16,17
  17       16,17
  18       18,19
  19    18,19,20
  20       19,20

I would like to identify rows with common values, even if those values are not in all rows, and then keep only 1 of the rows. For example, rows 1-4 contain the combined set 1,2,3,4 and I would like to keep only one of these rows. If we keep the first row, the resulting df would be:

  1        1,2,3
  5          5,6
  8          8,9
  11    11,12,13
  15       15,16
  18       18,19

I've searched many other solutions on here and none include uneven rows lengths, and which is vital as the full data can have rows with dozens of values.

margusl · Accepted Answer

One option for this particular example data is to create an igraph graph from row overlaps, detect connected components in resulting graph and use component's cluster id as a grouping variable. From there we can pick the first row from every group.

library(dplyr)
library(igraph)

df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", 
                              "5,6", "5,6,7", "6,7", 
                              "8,9", "8,9,10", "9,10", 
                              "11,12,13", "11,12,13", 
                              "11,12,13,14", "13,14", 
                              "15,16", "15,16,17", "16,17", 
                              "18,19", "18,19,20", "19,20"))


df |> 
  mutate(id = row_number(), .before = 1) |> 
  group_by(
    g_clust = 
      strsplit(overlap, ",") |> 
      # either create a directed graph or set duplicate = FALSE for 
      # corner cases like `overlap = c("1", "1,2,3", ...)`
      graph_from_adj_list(mode = "all", duplicate = FALSE) |> 
      components() |> 
      getElement("membership")
    ) |> 
  slice_head(n = 1)
#> # A tibble: 6 × 3
#> # Groups:   g_clust [6]
#>      id overlap  g_clust
#>   <int> <chr>      <dbl>
#> 1     1 1,2,3          1
#> 2     5 5,6            2
#> 3     8 8,9            3
#> 4    11 11,12,13       4
#> 5    15 15,16          5
#> 6    18 18,19          6

Overlaps graph for reference:

strsplit(df$overlap, ",") |> 
  graph_from_adj_list(mode = "all", duplicate = FALSE) |>
  plot()

graph

Friede · Answer

We can try {ivs}:

x = vapply(strsplit(unique(overlap), ","), 
           \(i) as.numeric(i[c(1, length(i))]), numeric(2))

library(ivs)
int = iv_groups(iv(x[1, ], x[2, ]))

giving

> as.data.frame(int)
         y
1   [1, 4)
2   [5, 7)
3  [8, 10)
4 [11, 14)
5 [15, 17)
6 [18, 20)

The vapply is a bit redundant as we call as.numeric several times. Do you really want comma-separated integers stored as character?

transform(as.data.frame(int), 
          s = Vectorize(\(x, y) toString(x:y))(iv_start(int), iv_end(int) - 1))

giving

       int          s
1   [1, 4)    1, 2, 3
2   [5, 7)       5, 6
3  [8, 10)       8, 9
4 [11, 14) 11, 12, 13
5 [15, 17)     15, 16
6 [18, 20)     18, 19

Edit

@Chris is right in the comment below. I should add some explanation.

(1) Re-structure the data. Split the strings, find first and last value, coerce character to numeric.

x = # we assign the output of the pipe x |> ... |> ... to x 
  overlap |> # access the data
  unique() |> # get rid of duplicates (not needed)
  strsplit(",") |> # split on ",", we might want to add fixed=TRUE
  # returns a list of character vectors, so we iterate over it with lapply
  lapply(\(x) x[c(1, length(x))]) |> # get first and last element
  # "1,2,3" ---> "1" "2" "3" has length 3 while "3" "4" has length 2
  do.call(what="rbind") |> # list to 2-column matrix 
  type.convert(as.is=TRUE) # we coerce from character to numeric

gives

> x
      [,1] [,2]
 [1,]    1    3
 [2,]    1    4
 [3,]    3    4
 [4,]    5    6
 [5,]    5    7
 [6,]    6    7
 [7,]    8    9
 [8,]    8   10
 [9,]    9   10
[10,]   11   13
[11,]   11   14
[12,]   13   14
[13,]   15   16
[14,]   15   17
[15,]   16   17
[16,]   18   19
[17,]   18   20
[18,]   19   20
> 
> # of 
> class(x)
[1] "matrix" "array"

This obviously assumes that the lowest integer is in first position and the highest in last--a reasonable assumption? Otherwise we should coerce to numeric first and apply range on each list element.

(2) To create interval vectors, we use iv(). From it's documentation (cp. help(iv)):

iv() creates an interval vector from start and end vectors. This is how you will typically create interval vectors, and is often used with columns in a data frame.

i.e.

> library(ivs)
> y = iv(x[, 1], x[, 2])
> y
<iv<integer>[18]>
 [1] [1, 3)   [1, 4)   [3, 4)   [5, 6)   [5, 7)   [6, 7)   [8, 9)   [8, 10) 
 [9] [9, 10)  [11, 13) [11, 14) [13, 14) [15, 16) [15, 17) [16, 17) [18, 19)
[17] [18, 20) [19, 20)

Finally, we use iv_groups. From help(iv_groups):

This family of functions revolves around grouping overlapping intervals within a single iv. When multiple overlapping intervals are grouped together they result in a wider interval containing the smallest iv_start() and the largest iv_end() of the overlaps.

> z = iv_groups(y)
> z
<iv<integer>[6]>
[1] [1, 4)   [5, 7)   [8, 10)  [11, 14) [15, 17) [18, 20)

(3) It seems like your desired output is a character vector, where all integer sequences of the remaining intervals are pasted together.

To achieve this, we use ivs::iv_start() and ivs::iv_end() to access the boundaries of each interval. Now we would like to generate regular sequences. Unfortunately, : is not vectorised, hence we introduce:

seq2str = Vectorize(\(from, to) toString(from:to))

toSpring() is a wrapper for format(), it's help page states

The default method first converts x to character and then concatenates the elements separated by ", ".

Applying our custom function to each start and end gives

> seq2str(from = iv_start(z), to = iv_end(z) - 1)
[1] "1, 2, 3"    "5, 6"       "8, 9"       "11, 12, 13" "15, 16"     "18, 19"

Note

You can also use this shorter version.

# input data 
overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", "5,6", "5,6,7", "6,7", "8,9", 
            "8,9,10", "9,10", "11,12,13", "11,12,13", "11,12,13,14", "13,14",
            "15,16", "15,16,17", "16,17", "18,19", "18,19,20", "19,20")
# piped version
library(ivs)
overlap |> 
  strsplit(",") |> 
  lapply(\(x) x[c(1, length(x))]) |> 
  do.call(what="rbind") |>
  type.convert(as.is=TRUE) |>
  list(. = _) |>
  with(iv(.[, 1], .[, 2])) |>
  iv_groups() |>
  list(. = _) |>
  with(Vectorize(\(from, to) toString(from:to))(iv_start(.), to = iv_end(.) - 1))

where we use a trick to be able to make use of the forward pipe operator from base.

Filter rows based on combined set of values in a string

Tags:

dataframe

r

filter

cluster-analysis

bcrew

2 Answers

margusl

Friede

Recent Activity

Donate For Us

Filter rows based on combined set of values in a string

Tags:

dataframe

r

filter

cluster-analysis

bcrew

2 Answers

margusl

Friede

Related questions

Recent Activity

Donate For Us