Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the indices of the top 10,000 elements in a symmetric matrix(12k X 12k) in R

Tags:

r

matrix

I have a nonzero symmetric matrix 'matr' that is 12000X12000. I need to find the indices of the top 10000 elements in 'matr' in R. The code I have written takes a long time - I was wondering if there was any pointers to make it faster.

listk <- numeric(0)
for( i in 1:10000) {
    idx <- which(matr == max(matr), arr.ind=T)
    if( length(idx) != 0) {
        listk <- rbind( listk, idx[1,])
        matr[idx[1,1], idx[1,2]] <- 0
        matr[idx[2,1], idx[2,2]] <- 0
    } 
}
like image 927
user1775614 Avatar asked Feb 11 '13 22:02

user1775614


3 Answers

Here's how you might find the indices (ij) of the 4 largest elements in a 10-by-10 matrix m.

## Sample data
m <- matrix(runif(100), ncol=10)

## Extract the indices of the 4 largest elements
(ij <- which(m >= sort(m, decreasing=T)[4], arr.ind=TRUE))
#      row col
# [1,]   2   1
# [2,]   5   1
# [3,]   6   2
# [4,]   3  10

## Use the indices to extract the values
m[ij]
#  [1] 0.9985190 0.9703268 0.9836373 0.9914510

Edit:

For large matrices, performing a partial sort will be a faster way to find the 10,000th largest element:

v <- runif(1e7)
system.time(a <- sort(v, decreasing=TRUE)[10000])
#    user  system elapsed 
#    4.35    0.03    4.38 
system.time(b <- -sort(-v, partial=10000)[10000])
#    user  system elapsed 
#    0.60    0.09    0.69 
a==b
# [1] TRUE
like image 61
Josh O'Brien Avatar answered Jan 01 '23 20:01

Josh O'Brien


I like @JoshO'Brien 's answer; the use of partial sorting is great! Here's an Rcpp solution (I'm not a strong C++ programmer so probably bone-headed errors; corrections welcome... how would I template this in Rcpp, to handle different types of input vector?)

I start by including the appropriate headers and using namespaces for convenience

#include <Rcpp.h>
#include <queue>

using namespace Rcpp;
using namespace std;

Then arrange to expose my C++ function to R

// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)

and define some variables, most importantly a priority_queue to hold as a pair the numeric value and index. The queue is ordered so the smallest values are at the 'top', with small relying on the standard pair<> comparator.

typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;

Now I'll walk through the input data, adding it to the queue if either (a) I don't yet have enough values or (b) the current value is larger than the smallest value in the queue. In the latter case, I pop off the smallest value, and insert it's replacement. In this way the priority queue always contains the n_max largest elements.

for (int i = 0; i != v.size(); ++i) {
    if (pq.size() < n)
        pq.push(Elt(v[i], i));
    else {
        Elt elt = Elt(v[i], i);
        if (pq.top() < elt) {
            pq.pop();
            pq.push(elt);
        }
    }
}

And finally I pop the indexes from the priority queue into the return vector, remembering to translate to 1-based R coordinates.

result.reserve(pq.size());
while (!pq.empty()) {
    result.push_back(pq.top().second + 1);
    pq.pop();
}

and return the result to R

return wrap(result);

This has nice memory use (the priority queue and return vector are both small relative to the original data) and is fast

> library(Rcpp); sourceCpp("top_i_pq.cpp"); z <- runif(12000 * 12000)
> system.time(top_i_pq(z, 10000))
   user  system elapsed 
  0.992   0.000   0.998 

Problems with this code include:

  1. The default comparator greater<Elt> works so that, in the case of a tie spanning the value of the _n_th element, the last, rather than first, duplicate is retained.

  2. NA values (and non-finite values?) may not be handled correctly; I'm not sure whether this is true or not.

  3. The function only works for NumericVector input, but the logic is appropriate for any R data type for which an appropriate ordering relationship is defined.

Problems 1 and 2 can likely be dealt with by writing an appropriate comparator; maybe for 2 this is already implemented in Rcpp? I don't know how to leverage C++ language features and the Rcpp design to avoid re-implementing the function for each data type I want to support.

Here's the full code:

#include <Rcpp.h>
#include <queue>

using namespace Rcpp;
using namespace std;

// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
{
    typedef pair<double, int> Elt;
    priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
    vector<int> result;

    for (int i = 0; i != v.size(); ++i) {
        if (pq.size() < n)
            pq.push(Elt(v[i], i));
        else {
            Elt elt = Elt(v[i], i);
            if (pq.top() < elt) {
                pq.pop();
                pq.push(elt);
            }
        }
    }

    result.reserve(pq.size());
    while (!pq.empty()) {
        result.push_back(pq.top().second + 1);
        pq.pop();
    }

    return wrap(result);
}
like image 34
Martin Morgan Avatar answered Jan 01 '23 21:01

Martin Morgan


A bit late into the party, but I came up with this, which avoids the sort.

Say you want the top 10k elements from you 12k x 12k matrix. The idea is to "clip" the data to the elements corresponding to a quantile of that size.

find_n_top_elements <- function( x, n ){

  #set the quantile to correspond to n top elements
  quant <- n / (dim(x)[1]*dim(x)[2])

  #select the cutpoint to get the quantile above quant
  lvl <- quantile(x, probs=1.0-quant)

  #select the elements above the cutpoint
  res <- x[x>lvl[[1]]]
}

#create a 12k x 12k matrix (1,1Gb!)
n <- 12000
x <- matrix( runif(n*n), ncol=n)

system.time( res <- find_n_top_elements( x, 10e3 ) )

Resulting in

system.time( res <- find_n_top_elements( x, 10e3 ) )
 user  system elapsed
 3.47    0.42    3.89 

For comparison, just sorting x on my system takes

system.time(sort(x))
   user  system elapsed 
  30.69    0.21   31.33 
like image 26
k_ride Avatar answered Jan 01 '23 19:01

k_ride