Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if list contains another list in R

Tags:

list

r

subset

I want to check if a list (or a vector, equivalently) is contained into another one, not if it is a subset of thereof. Let us assume we have

r <- c(1,1)
s <- c(5,2)
t <- c(1,2,5)

The function should behave as follows:

is.contained(r,t) 
[1] FALSE
# as (1,1) is not contained in (1,2,5) since the former 
# contains two 1 whereas the latter only one.

is.contained(s,t)
[1] TRUE

The operator %in% checks for subsets, hence it would return TRUE in both cases, likewise all or any. I am sure there is a one-liner but I just do not see it.

like image 370
gented Avatar asked Sep 05 '16 22:09

gented


Video Answer


2 Answers

How about using a loop. I iterate over the first vector and check if it is present in the second vector. If it is there i remove it from second vector. And the process continues.

is.contained=function(vec1,vec2){
    x=vector(length = length(vec1))
    for (i in 1:length(vec1)) {
        x[i] = vec1[i] %in% vec2
        if(length(which(vec1[i] %in% vec2)) == 0) vec2 else 
            vec2=vec2[-match(vec1[i], vec2)]
    }
    y=all(x==T)
    return(y)
}
like image 142
Chirayu Chamoli Avatar answered Oct 10 '22 07:10

Chirayu Chamoli


The sets functions (e.g. intersect, union, etc.) from base R give results consistent with set theory. Sets technically don't have repeating elements, thus the vector c(1,1,2) and c(1,2) are considered the same when it comes to sets (see Set (Mathematics)). This is the main problem this question faces and thus why some of the solutions posted here fail (including my previous attempts). The solution to the OP's question is found somewhere between understanding sets and sequences. Although sequences allow repetition, order matters, and here we don't care (order doesn't matter in sets).

Below, I have provided a vector intersect function (VectorIntersect) that returns all of the common elements between two vectors regardless of order or presence of duplicates. Also provided is a containment function called is.contained, which calls VectorIntersect, that will determine if all of the elements in one vector are present in another vector.

VectorIntersect <- function(v,z) {
    unlist(lapply(unique(v[v%in%z]), function(x) rep(x,min(sum(v==x),sum(z==x)))))
}
is.contained <- function(v,z) {length(VectorIntersect(v,z))==length(v)}

Let's look at a simple example:

r <- c(1, 1)
s <- c(rep(1, 5), rep("a", 5))
s
[1] "1" "1" "1" "1" "1" "a" "a" "a" "a" "a"

VectorIntersect(r, s)
[1] 1 1
is.contained(r, s)  ## r is contained in s
[1] TRUE
is.contained(s, r)  ## s is not contained in r
[1] FALSE
is.contained(s, s)  ## s is contained in itself.. more on this later
[1] TRUE

Now, let's look at @Gennaro's clever recursive approach which gives correct results (Many apologies and also many Kudos... on earlier tests, I was under the impression that it was checking to see if b was contained in s and not the other way around):

fun.contains(s, r)  ## s contains r
[1] TRUE
fun.contains(r, s)  ## r does not contain s
[1] FALSE
fun.contains(s, s)  ## s contains s
[1] TRUE

We will now step through the other set-based algorithms and see how they handle r and s above. I have added print statements to each function for clarity. First, @Jilber's function.

is.containedJilber <- function(x,y){
    temp <- intersect(x,y)
    print(temp); print(length(x)); print(length(temp)); print(all.equal(x, temp))
    out <- ifelse(length(x)==length(temp), all.equal(x, temp), FALSE)
    return(out)  
}

is.containedJilber(r, s)   ## should return TRUE but does not
[1] "1"                    ## result of intersect
[1] 2                      ## length of r
[1] 1                      ## length of temp

## results from all.equal.. gives weird results because lengths are different
[1] "Modes: numeric, character"    "Lengths: 2, 1"     "target is numeric, current is character"

[1] FALSE                  ## results from the fact that 2 does not equal 1

is.containedJilber(s, s)    ## should return TRUE but does not
[1] "1" "a"                 ## result of intersect
[1] 10                      ## length of s
[1] 2                       ## length of temp

## results from all.equal.. again, gives weird results because lengths are different
[1] "Lengths (10, 2) differ (string compare on first 2)" "1 string mismatch" 

[1] FALSE                  ## results from the fact that 10 does not equal 2

Here is @Simon's:

is.containedSimon <- function(x, y) {
    print(setdiff(x, y))
    z <- x[x %in%setdiff(x, y)]
    print(z); print(length(x)); print(length(y)); print(length(z))
    length(z) == length(x) - length(y)
}

is.containedSimon(s, r)    ## should return TRUE but does not
[1] "a"                    ## result of setdiff
[1] "a" "a" "a" "a" "a"    ## the elements in s that match the result of setdiff
[1] 10                     ## length of s
[1] 2                      ## length of r
[1] 5                      ## length of z
[1] FALSE                  ## result of 5 not being equal to 10 - 2

Hopefully this illustrates the pitfalls of applying strict set operations in this setting.

Let's test for efficiency and equality. Below, we build many test vectors and check to see if they are contained in either the vector testContainsNum (if it's a number vector) or testContainsAlpha (if it is a character vector):

set.seed(123)
testContainsNum <- sample(20:40, 145, replace=TRUE)       ## generate large vector with random numbers
testContainsAlpha <- sample(letters, 175, replace=TRUE)   ## generate large vector with random letters

tVec <- lapply(1:1000, function(x) {   ## generating test data.. 
    if (x%%2==0) {
        sample(20:40, sample(50:100, 1), replace=TRUE)  ## even indices will contain numbers
    } else {
        sample(letters, sample(50:90, 1), replace=TRUE)  ## odd indices will contain characters
    }
})

tContains <- lapply(1:1000, function(x) if (x%%2==0) {testContainsNum} else {testContainsAlpha})

## First check equality
tJoe <- mapply(is.contained, tVec, tContains)
tGennaro <- mapply(fun.contains, tContains, tVec)
tSimon <- mapply(is.containedSimon, tContains, tVec)
tJilber <- mapply(is.containedJilber, tVec, tContains)

all(tJoe==tGennaro)  ## Give same results
[1] TRUE

## Both Jilber's and Simon's solution don't return any TRUE values
any(tJilber)
[1] FALSE
any(tSimon)
[1] FALSE

## There should be 170 TRUEs
sum(tJoe)
[1] 170

Let's take a closer look to determine if is.contained and fun.contains are behaving correctly.

table(tVec[[3]])
a  b  c     e  f  g  h  i  j  k  l  m  n  o  p  q  r     t  u  v  w  x  y  z 
3  4  5     2  2  1  5  3  5  3  2  1  7  3  1  2  4     3  5  5  2  4  3  3

table(tContains[[3]])
a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z 
4 11  4  3  7  8 13  4  4  9 13  3 10  7  7  4  8  7  8  6  7  5  9  4  4  6

## Note above that tVec[[3]] has 1 more c and h than tContains[[3]],
## thus tVec[[3]] is not contained in tContains[[3]]
c(tJoe[3], tGennaro[3])  
[1] FALSE FALSE           ## This is correct!!!!


table(tVec[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
 6  4  4  7  6  3  4  6  3  5  4  4  6  4  4  2  2  5  3  1  4 

table(tContains[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 
 6  4 10  7  6  4 10  6  8 10  5  5  6  9  8  5  7  5 11  4  9 

## Note above that every element in tVec[[14]] is present in 
## tContains[[14]] and the number of occurences is less than or
## equal to the occurences in tContains[[14]]. Thus, tVec[[14]]
## is contained in tContains[[14]]
c(tJoe[14], tGennaro[14])
[1] TRUE TRUE            ## This is correct!!!!

Here are the benchmarks:

library(microbenchmark)
microbenchmark(Joe = mapply(is.contained, tVec, tContains),
               Gennaro = mapply(fun.contains, tContains, tVec))
Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval cld
    Joe 165.0310 172.7817 181.3722 178.7014 187.0826 230.2806   100  a 
Gennaro 249.8495 265.4022 279.0866 273.5923 288.1159 336.8464   100   b

Side Note about VectorIntersect()
After spending a good bit of time with this problem, it became increasingly clear that separating VectorIntersect from is.contained is tremendously valuable. I know many times in my own work, obtaining the intersection without duplicates being removed surfaced frequently. Oftentimes, the method implemented was messy and probably not reliable (easy to see why after this!). This is why VectorIntersect is a great utility function in additon to is.contained.


Update

Actually @Gennaro's solution can be improved quite a bit by calculating s[duplicated(s)] only one time as opposed to 3 times (similarly for b and length(s), we only calculate them once vs 2 times).

fun.containsFAST <- function(b, s){
    dupS <- s[duplicated(s)]; dupB <- b[duplicated(b)]
    lenS <- length(dupS)
    all(s %in% b) && lenS <= length(dupB) &&
        (if(lenS>0) fun.containsFAST(dupB,dupS) else 1)
}

microbenchmark(Joe = mapply(is.contained, tVec, tContains),
               GenFAST = mapply(fun.containsFAST, tContains, tVec),
               Gennaro = mapply(fun.contains, tContains, tVec))
Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval cld
    Joe 163.3529 172.1050 182.3634 177.2324 184.9622 293.8185   100  b 
GenFAST 145.3982 157.7183 169.3290 164.7898 173.4063 277.1561   100 a  
Gennaro 243.2416 265.8270 281.1472 273.5323 284.8820 403.7249   100   c


Update 2

What about testing containment for really big vectors? The function I provided is not likely to perform well as building the "intersection" (with duplicates etc.) by essentially looping over the true set intersection isn't very efficient. The modified @Gennaro's function won't be fast as well, because for very large vectors with many duplicates, the function calls could get nested pretty deep. With this in mind, I built yet another containment function that is specifically built for dealing with large vectors. It utilizes vectorized base R functions, especially of note pmin.int, which returns the parallel minima of multiple vectors. The inner function myL is taken directly from the guts of the rle function in base R (although slightly modified for this specific use).

is.containedBIG <- function(v, z) { ## v and z must be sorted
    myL <- function(x) {LX <- length(x); diff(c(0L, which(x[-1L] != x[-LX]), LX))}
    sum(pmin.int(myL(v[v %in% z]), myL(z[z %in% v])))==length(v)
}

Note that on smaller exmaples is.contained and fun.containsFAST are faster (this is mostly due to the time it takes to repeatedly sort.. as you will see, if the data is sorted is.containedBIG is much faster). Observe (for thoroughness we will also show the verification of @Chirayu's function and test's its efficiency):

## we are using tVec and tContains as defined above in the original test
tChirayu <- mapply(is.containedChirayu, tVec, tContains)
tJoeBIG <- sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]])))
all(tChirayu==tJoe)  ## @Chirayu's returns correct results
[1] TRUE
all(tJoeBIG==tJoe)   ## specialized alogrithm returns correct results
[1] TRUE

microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVec[[x]], tContains[[x]])),
               JoeBIG=sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]]))),
               GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContains[[x]], tVec[[x]])),
               Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVec[[x]], tContains[[x]])))
Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval cld
    Joe  154.6158  165.5861  176.3019  175.4786  180.1299  313.7974   100 a  
 JoeBIG  269.1460  282.9347  294.1568  289.0174  299.4687  445.5222   100  b   ## about 2x as slow as GenFAST
GenFAST  140.8219  150.5530  156.2019  155.8306  162.0420  178.7837   100 a  
Chirayu 1213.8962 1238.5666 1305.5392 1256.7044 1294.5307 2619.5370   100   c  ## about 8x as slow as GenFAST

Now, with sorted data, the results are quite astonishing. is.containedBIG shows a 3 fold improvement in speed whereas the other functions return almost identical timings.

## with pre-sorted data
tVecSort <- lapply(tVec, sort)
tContainsSort <- lapply(tContains, sort)
microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVecSort[[x]], tContainsSort[[x]])),
               JoeBIG=sapply(1:1000, function(x) is.containedBIG(tVecSort[[x]], tContainsSort[[x]])),
               GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContainsSort[[x]], tVecSort[[x]])),
               Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVecSort[[x]], tContainsSort[[x]])))
Unit: milliseconds
   expr        min         lq       mean     median         uq       max neval  cld
    Joe  154.74771  166.46926  173.45399  172.92374  177.09029  297.7758   100   c 
 JoeBIG   83.69259   87.35881   94.48476   92.07183   98.37235  221.6014   100 a  ## now it's the fastest
GenFAST  139.19631  151.23654  159.18670  157.05911  162.85636  275.7158   100  b  
Chirayu 1194.15362 1241.38823 1280.10058 1260.09439 1297.44847 1454.9805   100    d

For very large vectors, we have the following (only showing GenFAST and JoeBIG as the other functions will take too long):

set.seed(97)
randS <- sample(10^9, 8.5*10^5)
testBigNum <- sample(randS, 2*10^7, replace = TRUE)
tVecBigNum <- lapply(1:20, function(x) {
    sample(randS, sample(1500000:2500000, 1), replace=TRUE)
})

system.time(tJoeBigNum <- sapply(1:20, function(x) is.containedBIG(sort(tVecBigNum[[x]]), sort(testBigNum))))
   user  system elapsed 
 74.016  11.351  85.409 
system.time(tGennaroBigNum <- sapply(1:20, function(x) fun.containsFAST(testBigNum, tVecBigNum[[x]])))
   user  system elapsed 
662.875  54.238 720.433
sum(tJoeBigNum)
[1] 13
all(tJoeBigNum==tGennaroBigNum)
[1] TRUE

## pre-sorted data
testBigSort <- sort(testBigNum)
tVecBigSort <- lapply(tVecBigNum, sort)
system.time(tJoeBigSort <- sapply(1:20, function(x) is.containedBIG(tVecBigSort[[x]], testBigSort)))
   user  system elapsed 
 33.910  10.302  44.289
system.time(tGennaroBigSort <- sapply(1:20, function(x) fun.containsFAST(testBigSort, tVecBigSort[[x]])))
   user  system elapsed 
196.546  54.923 258.079
identical(tJoeBigSort, tGennaroBigSort, tJoeBigNum)
[1] TRUE

Regardless if your data is sorted or not, the point of this last test is to show that is.containedBIG is much faster on larger data. An interesting take away from this last test was the fact that fun.containsFAST showed a very large improvement in time when the data was sorted. I was under the impression that duplicated (which is the workhorse of fun.containsFAST), did not depend on whether a vector was sorted or not. Earlier test confirmed this sentiment (the unsorted test timings were practically identical to the sorted test timings (see above)). More research is needed.

like image 29
Joseph Wood Avatar answered Oct 10 '22 09:10

Joseph Wood