Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Union of intersecting vectors in a list in R

I have a list of vectors as follows.

data <- list(v1=c("a", "b", "c"), v2=c("g", "h", "k"), 
             v3=c("c", "d"), v4=c("n", "a"), v5=c("h", "i"))

I am trying to achieve the following:

  1. Check whether any of the vectors intersect with each other
  2. If intersecting vectors are found, get their union

So the desired output is

out <- list(v1=c("a", "b", "c", "d", "n"), v2=c("g", "h", "k", "i"))

I can get the union of a group of intersecting sets as follows.

 Reduce(union, list(data[[1]], data[[3]], data[[4]]))
 Reduce(union, list(data[[2]], data[[5]])

How to first identify the intersecting vectors? Is there a way of dividing the list into lists of groups of intersecting vectors?

#Update

Here is an attempt using data.table. Gets the desired results. But still slow for large lists as in this example dataset.

datasets. 
data <- sapply(data, function(x) paste(x, collapse=", "))
data <- as.data.frame(data, stringsAsFactors = F)

repeat {
  M <- nrow(data)
  data <- data.table( data , key = "data" )
  data <- data[ , list(dataelement = unique(unlist(strsplit(data , ", " )))), by = list(data)]
  data <- data.table(data , key = "dataelement" )
  data <- data[, list(data = paste0(sort(unique(unlist(strsplit(data, split=", ")))), collapse=", ")), by = "dataelement"]
  data$dataelement <- NULL
  data <- unique(data)
  N <- nrow(data)
  if (M == N)
    break
}

data <- strsplit(as.character(data$data) , "," )
like image 337
Crops Avatar asked Dec 17 '14 07:12

Crops


2 Answers

This is kind of like a graph problem so I like to use the igraph library for this, using your sample data, you can do

library(igraph)
#build edgelist
el <- do.call("rbind",lapply(data, embed, 2))
#make a graph
gg <- graph.edgelist(el, directed=F)
#partition the graph into disjoint sets
split(V(gg)$name, clusters(gg)$membership)

# $`1`
# [1] "b" "a" "c" "d" "n"
# 
# $`2`
# [1] "h" "g" "k" "i"

And we can view the results with

V(gg)$color=c("green","purple")[clusters(gg)$membership]
plot(gg)

enter image description here

like image 177
MrFlick Avatar answered Sep 23 '22 23:09

MrFlick


Here's another approach using only base R

Update

Next update after akrun's comment and with his sample data:

data <- list(v1=c('g', 'k'), v2= letters[1:4], v3= c('b', 'c', 'd', 'a'))

Modified function:

x <- lapply(seq_along(data), function(i) {
  if(!any(data[[i]] %in% unlist(data[-i]))) {
    data[[i]]
  } else if (any(data[[i]] %in% unlist(data[seq_len(i-1)]))) {
    NULL 
  } else {
    z <- lapply(data[-seq_len(i)], intersect,  data[[i]]) 
    z <- names(z[sapply(z, length) >= 1L])
    if (is.null(z)) NULL else union(data[[i]], unlist(data[z]))
  }
})
x[!sapply(x, is.null)]
#[[1]]
#[1] "g" "k"
#
#[[2]]
#[1] "a" "b" "c" "d"

This works well with the original sample data, MrFlick's sample data and akrun's sample data.

like image 21
talat Avatar answered Sep 22 '22 23:09

talat