Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any faster way to check if lists in a list are equivalent?

Here I have integers 1:7 for four different partitions, i.e., {1}, {2,3,4}, {5,6}, and {7} and those partitions are written in a list, i.e., list(1,c(2,3,4),c(5,6),7). I treat the partitions as sets, such that different permutation of elements within one partition should be recognized as the same one. For example, list(1,c(2,3,4),c(5,6),7) and list(7,1,c(2,3,4),c(6,5)) are equivalent.

Note that, there is no repetition for elements in list, e.g., no list(c(1,2),c(2,1),c(1,2)), since this problem is discussing exclusive partitions over the whole set.

I listed some of the different permutations into the list lst as below

lst <- list(list(1,c(2,3,4),c(5,6),7),
            list(c(2,3,4),1,7,c(5,6)),
            list(1,c(2,3,4),7,c(6,5)),
            list(7,1,c(3,2,4),c(5,6)))

and what I want to do is to verify all permutations are equivalent. If yes, then we get result TRUE.

What I did so far is to sort the elements within each partition, and used setdiff() with interset() and union() to judge it (see my code below)

s <- Map(function(v) Map(sort,v),lst)
equivalent <- length(setdiff(Reduce(union,s),Reduce(intersect,s),))==0

However, I guess this method would be slow whenever the partition size scales up. Is there any faster approach to make it? Appreciated in advance!

  • some test cases (small size data)
# should return `TRUE`
lst1 <- list(list(1,c(2,3,4),c(5,6)),
            list(c(2,3,4),1,c(5,6)),
            list(1,c(2,3,4),c(6,5)))

# should return `TRUE`
lst2 <- list(list(1:2, 3:4), list(3:4, 1:2))

# should return `FALSE`
lst3 <- list(list(1,c(2,3,4),c(5,6)), list(c(2,3,4),1,c(5,6)), list(1,c(2,3,5),c(6,4)))
like image 223
ThomasIsCoding Avatar asked Dec 10 '19 15:12

ThomasIsCoding


People also ask

How do you compare two lists to equal?

For the two lists to be equal, each element of the first list should be equal to the second list's corresponding element. If the two lists have the same elements, but the sequence is not the same, they will not be considered equal or identical lists.

How do you check if all values in a list are equal?

You can convert the list to a set. A set cannot have duplicates. So if all the elements in the original list are identical, the set will have just one element. if len(set(input_list)) == 1: # input_list has all identical elements.

Can you compare lists with ==?

Python sort() method and == operator to compare lists Further, the == operator is used to compare the list, element by element.

How do you check for the same values in a list?

A: Select the list you want to check to the Find values in textbox, and select the list you want to check based on into the According to textbox. B: check Each row in the Based on section. C: check Same values in Find section.

How to check if all the elements in a list are equal?

The program will print 'Equal' if all elements are the same else it will print 'Not Equal'. This example uses a built-in Python set datatype. Set does not allow duplicate elements. It also requires all of your elements to be hashable. Using this property of set, we can check whether all the elements in a list are the same or not.

How to compare two lists in SQL?

Method 1: Compare Two Lists Using Equal Sign Operator Method 2: Match Data by Using Row Difference Technique Method 3: Match Row Difference by Using IF Condition Method 4: Match Data Even If There is a Row Difference

What is the fastest way to check if a list intersects?

The fastest way is to hash the list you want to check, and then check each item in there. This is why Joe Koberg's answer is fast: checking set intersection is very fast. When you don't have a lot of data though, making sets can be a waste of time.


Video Answer


2 Answers

A post about R and any variant of fast isn't complete without a solution featuring rcpp.

To maximize efficiency, picking the correct data structure will be of utmost importance. Our data structure needs to store unique values and also have fast insert/access. This is exactly what std::unordered_set embodies. We only need to determine how we can uniquely identify each vector of unordered integers.

Enter the Fundamental Theorem of Arithmetic

The FTA states that every number can be uniquely represented (up to the order of the factors) by the product of primes numbers.

Here is an example demonstrating how we can use the FTA to quickly decipher if two vectors are equivalent up to order (N.B. P below is a list of primes numbers... (2, 3, 5, 7, 11, etc.):

                   Maps to                    Maps to              product
vec1 = (1, 2, 7)    -->>    P[1], P[2], P[7]   --->>   2,  3, 17     -->>   102
vec2 = (7, 3, 1)    -->>    P[7], P[3], P[1]   --->>  17,  5,  2     -->>   170
vec3 = (2, 7, 1)    -->>    P[2], P[7], P[1]   --->>   3, 17,  2     -->>   102

From this, we see that vec1 and vec3 correctly map to the same number, whereas vec2 gets mapped to a different value.

Since our actual vectors may contain up to a hundred integers less than 1000, applying the FTA will yield extremely large numbers. We can get around this by taking advantage of the product rule of logarithm:

logb(xy) = logb(x) + logb(y)

With this at our disposal, we will be able to tackle much larger numbers example (This starts deteriorating on extremely large examples).

First, we need a simple prime number generator (N.B. We are actually generating the log of each prime number).

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]

void getNPrimes(std::vector<double> &logPrimes) {
    
    const int n = logPrimes.size();
    const int limit = static_cast<int>(2.0 * static_cast<double>(n) * std::log(n));
    std::vector<bool> sieve(limit + 1, true);
    
    int lastP = 3;
    const int fsqr = std::sqrt(static_cast<double>(limit));
    
    while (lastP <= fsqr) {
        for (int j = lastP * lastP; j <= limit; j += 2 * lastP)
            sieve[j] = false;
        
        int ind = 2;
        
        for (int k = lastP + 2; !sieve[k]; k += 2)
            ind += 2;
        
        lastP += ind;
    }
    
    logPrimes[0] = std::log(2.0);
    
    for (int i = 3, j = 1; i <= limit && j < n; i += 2)
        if (sieve[i])
            logPrimes[j++] = std::log(static_cast<double>(i));
}

And here is the main implementation:

// [[Rcpp::export]]
bool f_Rcpp_Hash(List x) {
    
    List tempLst = x[0];
    const int n = tempLst.length();
    int myMax = 0;

    // Find the max so we know how many primes to generate
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        const int tempMax = *std::max_element(v.cbegin(), v.cend());
        
        if (tempMax > myMax)
            myMax = tempMax;
    }
    
    std::vector<double> logPrimes(myMax + 1, 0.0);
    getNPrimes(logPrimes);
    double sumMax = 0.0;
    
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        double mySum = 0.0;
        
        for (auto j: v)
            mySum += logPrimes[j];
        
        if (mySum > sumMax)
            sumMax = mySum;
    }

    // Since all of the sums will be double values and we want to
    // ensure that they are compared with scrutiny, we multiply
    // each sum by a very large integer to bring the decimals to
    // the right of the zero and then convert them to an integer.
    // E.g. Using the example above v1 = (1, 2, 7) & v2 = (7, 3, 1)
    //              
    //    sum of log of primes for v1 = log(2) + log(3) + log(17)
    //                               ~= 4.62497281328427
    //
    //    sum of log of primes for v2 = log(17) + log(5) + log(2)
    //                               ~= 5.13579843705026
    //    
    //    multiplier = floor(.Machine$integer.max / 5.13579843705026)
    //    [1] 418140173
    //    
    // Now, we multiply each sum and convert to an integer
    //    
    //    as.integer(4.62497281328427 * 418140173)
    //    [1] 1933886932    <<--   This is the key for v1
    //
    //    as.integer(5.13579843705026 * 418140173)
    //    [1] 2147483646    <<--   This is the key for v2
    
    const uint64_t multiplier = std::numeric_limits<int>::max() / sumMax;
    std::unordered_set<uint64_t> canon;
    canon.reserve(n);
    
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        double mySum = 0.0;
        
        for (auto j: v)
            mySum += logPrimes[j];
        
        canon.insert(static_cast<uint64_t>(multiplier * mySum));
    }
    
    const auto myEnd = canon.end();
    
    for (auto it = x.begin() + 1; it != x.end(); ++it) {
        List tempLst = *it;
        
        if (tempLst.length() != n)
            return false;

        for (int j = 0; j < n; ++j) {
            IntegerVector v = tempLst[j];
            double mySum = 0.0;
            
            for (auto k: v)
                mySum += logPrimes[k];
            
            const uint64_t key = static_cast<uint64_t>(multiplier * mySum);
            
            if (canon.find(key) == myEnd)
                return false;
        }
    }
    
    return true;
}

Here are the results when applied to lst1, lst2, lst3, & lst (the large one) given by @GKi.

f_Rcpp_Hash(lst)
[1] TRUE

f_Rcpp_Hash(lst1)
[1] TRUE

f_Rcpp_Hash(lst2)
[1] FALSE

f_Rcpp_Hash(lst3)
[1] FALSE

And here are some benchmarks with the units parameter set to relative.

microbenchmark(check = 'equal', times = 10
               , unit = "relative"
               , f_ThomsIsCoding(lst3)
               , f_chinsoon12(lst3)
               , f_GKi_6a(lst3)
               , f_GKi_6b(lst3)
               , f_Rcpp_Hash(lst3))
Unit: relative
                 expr       min        lq      mean    median        uq       max neval
f_ThomsIsCoding(lst3) 84.882393 63.541468 55.741646 57.894564 56.732118 33.142979    10
   f_chinsoon12(lst3) 31.984571 24.320220 22.148787 22.393368 23.599284 15.211029    10
       f_GKi_6a(lst3)  7.207269  5.978577  5.431342  5.761809  5.852944  3.439283    10
       f_GKi_6b(lst3)  7.399280  5.751190  6.350720  5.484894  5.893290  8.035091    10
    f_Rcpp_Hash(lst3)  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    10


microbenchmark(check = 'equal', times = 10
               , unit = "relative"
               , f_ThomsIsCoding(lst)
               , f_chinsoon12(lst)
               , f_GKi_6a(lst)
               , f_GKi_6b(lst)
               , f_Rcpp_Hash(lst))
Unit: relative
                expr        min         lq       mean     median        uq       max neval
f_ThomsIsCoding(lst) 199.776328 202.318938 142.909407 209.422530 91.753335 85.090838    10
   f_chinsoon12(lst)   9.542780   8.983248   6.755171   9.766027  4.903246  3.834358    10
       f_GKi_6a(lst)   3.169508   3.158366   2.555443   3.731292  1.902140  1.649982    10
       f_GKi_6b(lst)   2.992992   2.943981   2.019393   3.046393  1.315166  1.069585    10
    f_Rcpp_Hash(lst)   1.000000   1.000000   1.000000   1.000000  1.000000  1.000000    10

About 3x faster than the fastest solution yet on the larger example.

What does this mean?

To me, this result speaks volumes to the beauty and efficiency of base R as displayed by @GKi, @chinsoon12, @Gregor, @ThomasIsCoding, and more. We wrote around 100 lines of very specific C++ to obtain a moderate speed up. To be fair, the base R solutions end up calling mostly compiled code and end up utilizing hash tables as we did above.

like image 63
Joseph Wood Avatar answered Nov 09 '22 09:11

Joseph Wood


After sorting you can use duplicated and all.

s <- lapply(lst, function(x) lapply(x, sort)) #Sort vectors
s <- lapply(s, function(x) x[order(vapply(x, "[", 1, 1))]) #Sort lists
all(duplicated(s)[-1]) #Test if there are all identical
#length(unique(s)) == 1 #Alternative way to test if all are identical

Alternative: Sort in one loop

s <- lapply(lst, function(x) {
  tt <- lapply(x, sort)
  tt[order(vapply(tt, "[", 1, 1))]
})
all(duplicated(s)[-1])

Alternative: Sort during loop and allow early exit

s <- lapply(lst[[1]], sort)
s <- s[order(vapply(s, "[", 1, 1))]
tt  <- TRUE
for(i in seq(lst)[-1]) {
  x <- lapply(lst[[i]], sort)
  x <- x[order(vapply(x, "[", 1, 1))]
  if(!identical(s, x)) {
    tt  <- FALSE
    break;
  }
}
tt

or using setequal

s <- lapply(lst[[1]], sort)
tt  <- TRUE
for(i in seq(lst)[-1]) {
  x <- lapply(lst[[i]], sort)
  if(!setequal(s, x)) {
    tt  <- FALSE
    break;
  }
}
tt

or improving slightly the idea from @chinsoon12 to exchange the list with a vector!

s <- lst[[1]][order(vapply(lst[[1]], min, 1))]
s <- rep(seq_along(s), lengths(s))[order(unlist(s))]
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]][order(vapply(lst[[i]], min, 1))]
  x <- rep(seq_along(x), lengths(x))[order(unlist(x))]
  if(!identical(s, x)) {tt <- FALSE; break;}
}
tt

or avoid the second order

s <- lst[[1]][order(vapply(lst[[1]], min, 1))]
s <- rep(seq_along(s), lengths(s))[order(unlist(s))]
y <- s
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]][order(vapply(lst[[i]], min, 1))]
  y <- y[0]
  y[unlist(x)] <- rep(seq_along(x), lengths(x))
  if(!identical(s, y)) {tt <- FALSE; break;}
}
tt

or exchange order with match (or fmatch)

x <- lst[[1]]
s <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
s <- match(s, unique(s))
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]]
  y <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
  y <- match(y, unique(y))
  if(!identical(s, y)) {tt <- FALSE; break;}
}
tt

Or without early exit.

s <- lapply(lst, function(x) {
  y <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
  match(y, unique(y))
})
all(duplicated(s)[-1])

or written in C++

sourceCpp(code = "#include <Rcpp.h>
#include <vector>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
bool f_GKi_6_Rcpp(const List &x) {
  const List &x0 = x[0];
  const unsigned int n = x0.length();
  unsigned int nn = 0;
  for (List const &i : x0) {nn += i.length();}
  std::vector<int> s(nn);
  for (unsigned int i=0; i<n; ++i) {
    const IntegerVector &v = x0[i];
    for (int const &j : v) {
      if(j > nn) return false;
      s[j-1] = i;
    }
  }
  {
    std::vector<int> lup(n, -1);
    int j = 0;
    for(int &i : s) {
      if(lup[i] < 0) {lup[i] = j++;}
      i = lup[i];
    }
  }
  for (List const &i : x) {
    if(i.length() != n) return false;
    std::vector<int> sx(nn);
    for(unsigned int j=0; j<n; ++j) {
      const IntegerVector &v = i[j];
      for (int const &k : v) {
        if(k > nn) return false;
        sx[k-1] = j;
      }
    }
    {
      std::vector<int> lup(n, -1);
      int j = 0;
      for(int &i : sx) {
        int &lupp = lup[i];
        if(lupp == -1) {lupp = j; i = j++;
        } else {i = lupp;}
      }
    }
    if(s!=sx) return false;
  }
  return true;
}
")

Thanks to @Gregor for hints to improve the answer!

like image 43
GKi Avatar answered Nov 09 '22 09:11

GKi