I have a large dataset with sites that were sampled irregularly over 40 years. I want to select the maximum number of sites that share, let’s say, at least 5 years of data.
Any pointers would be appreciated.
Here’s an example dataset:
library(tidyverse)
set.seed(123)
DF <- tibble(
Sites = 1:100,
NYears = rbinom(100, 40, .2)
) %>%
rowwise() %>%
mutate(Years = list(sample(1982:2021, NYears))) %>%
unnest(Years) %>%
select(-NYears)
The approach below is NOT as efficient as the solution by @jblood94 (so, DON'T use my solution for large dataset if you are pursuing the speed), but just to change the mindset by thinking in the graph-theory way and explore the possibility of using igraph to solve the problem.
Generally speaking, I think this question can be processed in the graph theory manner and solved by igraph. If you are pursuing the efficiency, you probably need to explore the potential properties hidden behind the graph. For example:
Years can be interpreted as the edge weight that is associated with two Sites vertices.<=4 can be skipped when searching the cliques. Pruning network and searching afterwards should be more efficient than iterating over all possible combinations.If you are interested in the details, please refer to the the follow-up answer with code breakdowns.
igraph ApproachBelow might be one igraph option to solve the problem (see comments to the code for details): You can try graph_from_adjacency_matrix on Sites and find the cliques using cliques(), e.g.,
res <- DF %>%
table() %>%
tcrossprod() %>%
# build a graph based on the adjacency matrix of `Sites`, where the "weight" attribute denotes the number of shared `Years`
graph_from_adjacency_matrix(
"undirected",
diag = FALSE,
weighted = TRUE
) %>%
# prune the graph by keeping only the arcs that meet the condition, i.e., >= 5 (share at least 5 years of data)
subgraph.edges(E(.)[E(.)$weight > 4]) %>%
# find all cliques
cliques(min = 2) %>%
# double check if `Sites` in each clique meet the condition, using full info from `DF`
Filter(
\(q) {
sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) > 4
}, .
) %>%
# pick the clique that consists of the maximum number of `Sites`
`[`(lengths(.) == max(lengths(.)))
or an alternative
res <- DF %>%
table() %>%
tcrossprod() %>%
`>=`(5) %>%
graph_from_adjacency_matrix(mode = "undirected", diag = FALSE) %>%
# find all cliques
cliques(min = 2) %>%
# double check if `Sites` in each clique meet the condition, using full info from `DF`
Filter(
\(q) {
sum(table(with(DF, Years[Sites %in% names(q)])) == length(q)) >= 5
}, .
) %>%
# pick the clique that consists of the maximum number of `Sites`
`[`(lengths(.) == max(lengths(.)))
which gives
> res
[[1]]
+ 3/57 vertices, named, from d7ac134:
[1] 31 59 67
[[2]]
+ 3/57 vertices, named, from d7ac134:
[1] 26 53 84
If you want to further show the shared years, you can take additional actions on top of res, e.g.,
lapply(
res,
\(q) {
list(
sites = as.integer(names(q)),
sharedYears = as.integer(names(which(table(with(DF, Years[Sites %in% names(q)])) == length(q))))
)
}
)
which gives
[[1]]
[[1]]$sites
[1] 31 59 67
[[1]]$shared_years
[1] 1989 1992 1999 2002 2005
[[2]]
[[2]]$sites
[1] 26 53 84
[[2]]$shared_years
[1] 1989 1991 1998 2001 2011
In the igraph options above, cliques() would be the performance bottleneck especially when the condition is "the number of shared Years should be >=k" for small ks, e.g., k=1 or k=2. In those cases, there are significantly more cliques to be enumerated by cliques() before Filter(). You can refer to the benchmarking results by @jblood94.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With