Selecting the maximum number of values from a column that share values in another column in R

Question

I have a large dataset with sites that were sampled irregularly over 40 years. I want to select the maximum number of sites that share, let’s say, at least 5 years of data.

Any pointers would be appreciated.

Here’s an example dataset:

library(tidyverse)

set.seed(123)

DF <- tibble(
  Sites = 1:100,
  NYears = rbinom(100, 40, .2)
  ) %>%
  rowwise() %>%
  mutate(Years = list(sample(1982:2021, NYears))) %>%
  unnest(Years) %>%
  select(-NYears)

ThomasIsCoding · Accepted Answer

Disclaimer

The approach below is NOT as efficient as the solution by @jblood94 (so, DON'T use my solution for large dataset if you are pursuing the speed), but just to change the mindset by thinking in the graph-theory way and explore the possibility of using igraph to solve the problem.

Brief Idea (Graph Theory Perspective)

Generally speaking, I think this question can be processed in the graph theory manner and solved by igraph. If you are pursuing the efficiency, you probably need to explore the potential properties hidden behind the graph. For example:

The number of shared Years can be interpreted as the edge weight that is associated with two Sites vertices.
Also, the graph can be further simplified since some edges with weight <=4 can be skipped when searching the cliques. Pruning network and searching afterwards should be more efficient than iterating over all possible combinations.

If you are interested in the details, please refer to the the follow-up answer with code breakdowns.

One `igraph` Approach

Below might be one igraph option to solve the problem (see comments to the code for details): You can try graph_from_adjacency_matrix on Sites and find the cliques using cliques(), e.g.,

res <- DF %>%
    table() %>%
    tcrossprod() %>%
    # build a graph based on the adjacency matrix of `Sites`, where the "weight" attribute denotes the number of shared `Years`
    graph_from_adjacency_matrix(
        "undirected",
        diag = FALSE,
        weighted = TRUE
    ) %>%
    # prune the graph by keeping only the arcs that meet the condition, i.e., >= 5 (share at least 5 years of data)
    subgraph.edges(E(.)[E(.)$weight > 4]) %>%
    # find all cliques
    cliques(min = 2) %>%
    # double check if `Sites` in each clique meet the condition, using full info from `DF`
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) > 4
        }, .
    ) %>%
    # pick the clique that consists of the maximum number of `Sites`
    `[`(lengths(.) == max(lengths(.)))

or an alternative

res <- DF %>%
    table() %>%
    tcrossprod() %>%
    `>=`(5) %>%
    graph_from_adjacency_matrix(mode = "undirected", diag = FALSE) %>%
    # find all cliques
    cliques(min = 2) %>%
    # double check if `Sites` in each clique meet the condition, using full info from `DF`
    Filter(
        \(q) {
            sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= 5
        }, .
    ) %>%
    # pick the clique that consists of the maximum number of `Sites`
    `[`(lengths(.) == max(lengths(.)))

which gives

> res
[[1]]
+ 3/57 vertices, named, from d7ac134:
[1] 31 59 67

[[2]]
+ 3/57 vertices, named, from d7ac134:
[1] 26 53 84

If you want to further show the shared years, you can take additional actions on top of res, e.g.,

lapply(
    res,
    \(q) {
        list(
            sites = as.integer(names(q)),
            sharedYears = as.integer(names(which(table(with(DF, Years[Sites %in% names(q)])) == length(q))))
        )
    }
)

which gives

[[1]]
[[1]]$sites
[1] 31 59 67

[[1]]$shared_years
[1] 1989 1992 1999 2002 2005


[[2]]
[[2]]$sites
[1] 26 53 84

[[2]]$shared_years
[1] 1989 1991 1998 2001 2011

Discussion

In the igraph options above, cliques() would be the performance bottleneck especially when the condition is "the number of shared Years should be >=k" for small ks, e.g., k=1 or k=2. In those cases, there are significantly more cliques to be enumerated by cliques() before Filter(). You can refer to the benchmarking results by @jblood94.

Selecting the maximum number of values from a column that share values in another column in R

Tags:

performance

algorithm

dataframe

r

igraph

mferreira

1 Answers

Disclaimer

Brief Idea (Graph Theory Perspective)

One `igraph` Approach

Discussion

ThomasIsCoding

Recent Activity

Donate For Us

Selecting the maximum number of values from a column that share values in another column in R

Tags:

performance

algorithm

dataframe

r

igraph

mferreira

1 Answers

Disclaimer

Brief Idea (Graph Theory Perspective)

One igraph Approach

Discussion

ThomasIsCoding

Related questions

Recent Activity

Donate For Us

One `igraph` Approach