Any faster way to check if lists in a list are equivalent?

Tags:

Here I have integers 1:7 for four different partitions, i.e., {1}, {2,3,4}, {5,6}, and {7} and those partitions are written in a list, i.e., list(1,c(2,3,4),c(5,6),7). I treat the partitions as sets, such that different permutation of elements within one partition should be recognized as the same one. For example, list(1,c(2,3,4),c(5,6),7) and list(7,1,c(2,3,4),c(6,5)) are equivalent.

Note that, there is no repetition for elements in list, e.g., no list(c(1,2),c(2,1),c(1,2)), since this problem is discussing exclusive partitions over the whole set.

I listed some of the different permutations into the list lst as below

lst <- list(list(1,c(2,3,4),c(5,6),7),
            list(c(2,3,4),1,7,c(5,6)),
            list(1,c(2,3,4),7,c(6,5)),
            list(7,1,c(3,2,4),c(5,6)))

and what I want to do is to verify all permutations are equivalent. If yes, then we get result TRUE.

What I did so far is to sort the elements within each partition, and used setdiff() with interset() and union() to judge it (see my code below)

s <- Map(function(v) Map(sort,v),lst)
equivalent <- length(setdiff(Reduce(union,s),Reduce(intersect,s),))==0

However, I guess this method would be slow whenever the partition size scales up. Is there any faster approach to make it? Appreciated in advance!

some test cases (small size data)

# should return `TRUE`
lst1 <- list(list(1,c(2,3,4),c(5,6)),
            list(c(2,3,4),1,c(5,6)),
            list(1,c(2,3,4),c(6,5)))

# should return `TRUE`
lst2 <- list(list(1:2, 3:4), list(3:4, 1:2))

# should return `FALSE`
lst3 <- list(list(1,c(2,3,4),c(5,6)), list(c(2,3,4),1,c(5,6)), list(1,c(2,3,5),c(6,4)))

223

asked Dec 10 '19 15:12

ThomasIsCoding

Video Answer

2 Answers

A post about R and any variant of fast isn't complete without a solution featuring rcpp.

To maximize efficiency, picking the correct data structure will be of utmost importance. Our data structure needs to store unique values and also have fast insert/access. This is exactly what std::unordered_set embodies. We only need to determine how we can uniquely identify each vector of unordered integers.

Enter the Fundamental Theorem of Arithmetic

The FTA states that every number can be uniquely represented (up to the order of the factors) by the product of primes numbers.

Here is an example demonstrating how we can use the FTA to quickly decipher if two vectors are equivalent up to order (N.B. P below is a list of primes numbers... (2, 3, 5, 7, 11, etc.):

                   Maps to                    Maps to              product
vec1 = (1, 2, 7)    -->>    P[1], P[2], P[7]   --->>   2,  3, 17     -->>   102
vec2 = (7, 3, 1)    -->>    P[7], P[3], P[1]   --->>  17,  5,  2     -->>   170
vec3 = (2, 7, 1)    -->>    P[2], P[7], P[1]   --->>   3, 17,  2     -->>   102

From this, we see that vec1 and vec3 correctly map to the same number, whereas vec2 gets mapped to a different value.

Since our actual vectors may contain up to a hundred integers less than 1000, applying the FTA will yield extremely large numbers. We can get around this by taking advantage of the product rule of logarithm:

log_b(xy) = log_b(x) + log_b(y)

With this at our disposal, we will be able to tackle much larger numbers example (This starts deteriorating on extremely large examples).

First, we need a simple prime number generator (N.B. We are actually generating the log of each prime number).

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]

void getNPrimes(std::vector<double> &logPrimes) {
    
    const int n = logPrimes.size();
    const int limit = static_cast<int>(2.0 * static_cast<double>(n) * std::log(n));
    std::vector<bool> sieve(limit + 1, true);
    
    int lastP = 3;
    const int fsqr = std::sqrt(static_cast<double>(limit));
    
    while (lastP <= fsqr) {
        for (int j = lastP * lastP; j <= limit; j += 2 * lastP)
            sieve[j] = false;
        
        int ind = 2;
        
        for (int k = lastP + 2; !sieve[k]; k += 2)
            ind += 2;
        
        lastP += ind;
    }
    
    logPrimes[0] = std::log(2.0);
    
    for (int i = 3, j = 1; i <= limit && j < n; i += 2)
        if (sieve[i])
            logPrimes[j++] = std::log(static_cast<double>(i));
}

And here is the main implementation:

// [[Rcpp::export]]
bool f_Rcpp_Hash(List x) {
    
    List tempLst = x[0];
    const int n = tempLst.length();
    int myMax = 0;

    // Find the max so we know how many primes to generate
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        const int tempMax = *std::max_element(v.cbegin(), v.cend());
        
        if (tempMax > myMax)
            myMax = tempMax;
    }
    
    std::vector<double> logPrimes(myMax + 1, 0.0);
    getNPrimes(logPrimes);
    double sumMax = 0.0;
    
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        double mySum = 0.0;
        
        for (auto j: v)
            mySum += logPrimes[j];
        
        if (mySum > sumMax)
            sumMax = mySum;
    }

    // Since all of the sums will be double values and we want to
    // ensure that they are compared with scrutiny, we multiply
    // each sum by a very large integer to bring the decimals to
    // the right of the zero and then convert them to an integer.
    // E.g. Using the example above v1 = (1, 2, 7) & v2 = (7, 3, 1)
    //              
    //    sum of log of primes for v1 = log(2) + log(3) + log(17)
    //                               ~= 4.62497281328427
    //
    //    sum of log of primes for v2 = log(17) + log(5) + log(2)
    //                               ~= 5.13579843705026
    //    
    //    multiplier = floor(.Machine$integer.max / 5.13579843705026)
    //    [1] 418140173
    //    
    // Now, we multiply each sum and convert to an integer
    //    
    //    as.integer(4.62497281328427 * 418140173)
    //    [1] 1933886932    <<--   This is the key for v1
    //
    //    as.integer(5.13579843705026 * 418140173)
    //    [1] 2147483646    <<--   This is the key for v2
    
    const uint64_t multiplier = std::numeric_limits<int>::max() / sumMax;
    std::unordered_set<uint64_t> canon;
    canon.reserve(n);
    
    for (int i = 0; i < n; ++i) {
        IntegerVector v = tempLst[i];
        double mySum = 0.0;
        
        for (auto j: v)
            mySum += logPrimes[j];
        
        canon.insert(static_cast<uint64_t>(multiplier * mySum));
    }
    
    const auto myEnd = canon.end();
    
    for (auto it = x.begin() + 1; it != x.end(); ++it) {
        List tempLst = *it;
        
        if (tempLst.length() != n)
            return false;

        for (int j = 0; j < n; ++j) {
            IntegerVector v = tempLst[j];
            double mySum = 0.0;
            
            for (auto k: v)
                mySum += logPrimes[k];
            
            const uint64_t key = static_cast<uint64_t>(multiplier * mySum);
            
            if (canon.find(key) == myEnd)
                return false;
        }
    }
    
    return true;
}

Here are the results when applied to lst1, lst2, lst3, & lst (the large one) given by @GKi.

f_Rcpp_Hash(lst)
[1] TRUE

f_Rcpp_Hash(lst1)
[1] TRUE

f_Rcpp_Hash(lst2)
[1] FALSE

f_Rcpp_Hash(lst3)
[1] FALSE

And here are some benchmarks with the units parameter set to relative.

microbenchmark(check = 'equal', times = 10
               , unit = "relative"
               , f_ThomsIsCoding(lst3)
               , f_chinsoon12(lst3)
               , f_GKi_6a(lst3)
               , f_GKi_6b(lst3)
               , f_Rcpp_Hash(lst3))
Unit: relative
                 expr       min        lq      mean    median        uq       max neval
f_ThomsIsCoding(lst3) 84.882393 63.541468 55.741646 57.894564 56.732118 33.142979    10
   f_chinsoon12(lst3) 31.984571 24.320220 22.148787 22.393368 23.599284 15.211029    10
       f_GKi_6a(lst3)  7.207269  5.978577  5.431342  5.761809  5.852944  3.439283    10
       f_GKi_6b(lst3)  7.399280  5.751190  6.350720  5.484894  5.893290  8.035091    10
    f_Rcpp_Hash(lst3)  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    10


microbenchmark(check = 'equal', times = 10
               , unit = "relative"
               , f_ThomsIsCoding(lst)
               , f_chinsoon12(lst)
               , f_GKi_6a(lst)
               , f_GKi_6b(lst)
               , f_Rcpp_Hash(lst))
Unit: relative
                expr        min         lq       mean     median        uq       max neval
f_ThomsIsCoding(lst) 199.776328 202.318938 142.909407 209.422530 91.753335 85.090838    10
   f_chinsoon12(lst)   9.542780   8.983248   6.755171   9.766027  4.903246  3.834358    10
       f_GKi_6a(lst)   3.169508   3.158366   2.555443   3.731292  1.902140  1.649982    10
       f_GKi_6b(lst)   2.992992   2.943981   2.019393   3.046393  1.315166  1.069585    10
    f_Rcpp_Hash(lst)   1.000000   1.000000   1.000000   1.000000  1.000000  1.000000    10

About 3x faster than the fastest solution yet on the larger example.

What does this mean?

To me, this result speaks volumes to the beauty and efficiency of base R as displayed by @GKi, @chinsoon12, @Gregor, @ThomasIsCoding, and more. We wrote around 100 lines of very specific C++ to obtain a moderate speed up. To be fair, the base R solutions end up calling mostly compiled code and end up utilizing hash tables as we did above.

answered Nov 09 '22 09:11

Joseph Wood

After sorting you can use duplicated and all.

s <- lapply(lst, function(x) lapply(x, sort)) #Sort vectors
s <- lapply(s, function(x) x[order(vapply(x, "[", 1, 1))]) #Sort lists
all(duplicated(s)[-1]) #Test if there are all identical
#length(unique(s)) == 1 #Alternative way to test if all are identical

Alternative: Sort in one loop

s <- lapply(lst, function(x) {
  tt <- lapply(x, sort)
  tt[order(vapply(tt, "[", 1, 1))]
})
all(duplicated(s)[-1])

Alternative: Sort during loop and allow early exit

s <- lapply(lst[[1]], sort)
s <- s[order(vapply(s, "[", 1, 1))]
tt  <- TRUE
for(i in seq(lst)[-1]) {
  x <- lapply(lst[[i]], sort)
  x <- x[order(vapply(x, "[", 1, 1))]
  if(!identical(s, x)) {
    tt  <- FALSE
    break;
  }
}
tt

or using setequal

s <- lapply(lst[[1]], sort)
tt  <- TRUE
for(i in seq(lst)[-1]) {
  x <- lapply(lst[[i]], sort)
  if(!setequal(s, x)) {
    tt  <- FALSE
    break;
  }
}
tt

or improving slightly the idea from @chinsoon12 to exchange the list with a vector!

s <- lst[[1]][order(vapply(lst[[1]], min, 1))]
s <- rep(seq_along(s), lengths(s))[order(unlist(s))]
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]][order(vapply(lst[[i]], min, 1))]
  x <- rep(seq_along(x), lengths(x))[order(unlist(x))]
  if(!identical(s, x)) {tt <- FALSE; break;}
}
tt

or avoid the second order

s <- lst[[1]][order(vapply(lst[[1]], min, 1))]
s <- rep(seq_along(s), lengths(s))[order(unlist(s))]
y <- s
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]][order(vapply(lst[[i]], min, 1))]
  y <- y[0]
  y[unlist(x)] <- rep(seq_along(x), lengths(x))
  if(!identical(s, y)) {tt <- FALSE; break;}
}
tt

or exchange order with match (or fmatch)

x <- lst[[1]]
s <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
s <- match(s, unique(s))
tt <- TRUE
for(i in seq(lst)[-1]) {
  x <- lst[[i]]
  y <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
  y <- match(y, unique(y))
  if(!identical(s, y)) {tt <- FALSE; break;}
}
tt

Or without early exit.

s <- lapply(lst, function(x) {
  y <- "[<-"(integer(),unlist(x),rep(seq_along(x), lengths(x)))
  match(y, unique(y))
})
all(duplicated(s)[-1])

or written in C++

sourceCpp(code = "#include <Rcpp.h>
#include <vector>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
bool f_GKi_6_Rcpp(const List &x) {
  const List &x0 = x[0];
  const unsigned int n = x0.length();
  unsigned int nn = 0;
  for (List const &i : x0) {nn += i.length();}
  std::vector<int> s(nn);
  for (unsigned int i=0; i<n; ++i) {
    const IntegerVector &v = x0[i];
    for (int const &j : v) {
      if(j > nn) return false;
      s[j-1] = i;
    }
  }
  {
    std::vector<int> lup(n, -1);
    int j = 0;
    for(int &i : s) {
      if(lup[i] < 0) {lup[i] = j++;}
      i = lup[i];
    }
  }
  for (List const &i : x) {
    if(i.length() != n) return false;
    std::vector<int> sx(nn);
    for(unsigned int j=0; j<n; ++j) {
      const IntegerVector &v = i[j];
      for (int const &k : v) {
        if(k > nn) return false;
        sx[k-1] = j;
      }
    }
    {
      std::vector<int> lup(n, -1);
      int j = 0;
      for(int &i : sx) {
        int &lupp = lup[i];
        if(lupp == -1) {lupp = j; i = j++;
        } else {i = lupp;}
      }
    }
    if(s!=sx) return false;
  }
  return true;
}
")

Thanks to @Gregor for hints to improve the answer!

answered Nov 09 '22 09:11

GKi

Related questions
                            
                                Rcpp How to convert IntegerVector to NumericVector
                            
                                R knitr Add linebreak in table header kable()
                            
                                R generate all possible interaction variables
                            
                                geom_text with facet_wrap in ggplot2 when group specified
                            
                                Using a reactive expression in an if statement in shiny
                            
                                Dplyr - Filter if any variable is equal to a value
                            
                                ggplot, ggplotly, scale_y_continuous, ylim and percentage
                            
                                Adding value after every nth element of vector in R
                            
                                Group by one column, select row with minimum in one column for every pair of columns
                            
                                Why is bam from mgcv slow for some data?
                            
                                Decrease margins between plots when using cowplot
                            
                                Installing R on Linux: configure: WARNING: you cannot build PDF versions of the R manuals
                            
                                How to correctly convert NaN to NA
                            
                                using tidyr unnest with NULL values
                            
                                Find column number that satisfies condition
                            
                                curl package not available for several R packages
                            
                                Change legend title ggplot2 [duplicate]
                            
                                How to subset dataframe based on a "not equal to" criteria applied to a large number of columns?
                            
                                How to know the operations made to calculate the Levenshtein distance between strings?
                            
                                Creating a new column conditionally based on previous n rows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Any faster way to check if lists in a list are equivalent?

Tags:

performance

algorithm

list

r

if-statement