Rewriting slow R function in C++ & Rcpp

Tags:

I have this line of R code:

croppedDNA <- completeDNA[,apply(completeDNA,2,function(x) any(c(FALSE,x[-length(x)]!=x[-1])))]

What it does is identify the sites (cols) in a matrix of DNA sequences (1 row = one seq) that are not universal (informative) and subsets them from the matrix to make a new 'cropped matrix' i.e. get rid of all the columns in which values are the same. For a big dataset this takes about 6 seconds. I don't know if I can do it faster in C++ (still a beginner in C++) but it will be good for me to try. My idea is to use Rcpp, loop through the columns of the CharacterMatrix, pull out the column (the site) as a CharacterVector check if they are the same. If they are the same, record that column number/index, continue for all columns. Then at the end make a new CharacterMatrix that only includes those columns. It is important that I keep the rownames and column names as they are in th "R version" of the matrix i.e. if a column goes, so should the colname.

I've been writing for about two minutes, so far what I have is (not finished):

#include <Rcpp.h>
#include <vector>
using namespace Rcpp;
// [[Rcpp::export]]
CharacterMatrix reduce_sequences(CharacterMatrix completeDNA)
{
  std::vector<bool> informativeSites; 
  for(int i = 0; i < completeDNA.ncol(); i++)
  {
    CharacterVector bpsite = completeDNA(,i);
    if(all(bpsite == bpsite[1])
    {
      informativeSites.push_back(i);
    }
  }
CharacterMatrix cutDNA = completeDNA(,informativeSites);
return cutDNA;
}

Am I going the right way about this? Is there an easier way. My understanding is I need std::vector because it's easy to grow them (since I don't know in advance how many cols I am going to want to keep). With the indexing will I need to +1 to the informativeSites vector at the end (because R indexes from 1 and C++ from 0)?

Thanks, Ben W.

557

asked May 15 '13 02:05

Ward9250

1 Answers

Sample data:

set.seed(123)
z <- matrix(sample(c("a", "t", "c", "g", "N", "-"), 3*398508, TRUE), 3, 398508)

OP's solution:

system.time(y1 <- z[,apply(z,2,function(x) any(c(FALSE,x[-length(x)]!=x[-1])))])
#    user  system elapsed 
#   4.929   0.043   4.976

A faster version using base R:

system.time(y2 <- (z[, colSums(z[-1,] != z[-nrow(z), ]) > 0]))
#    user  system elapsed 
#   0.087   0.011   0.098

The results are identical:

identical(y1, y2)
# [1] TRUE

It's very possible c++ will beat it, but is it really necessary?

100

answered Sep 17 '22 23:09

flodel

Related questions
                            
                                Reading data from Dukascopy tick binary file
                            
                                How to check for memory leaks in a large scale c++ Linux application?
                            
                                Why is a default constructor required when storing in a map?
                            
                                can I have main window procedure as a lambda in WinMain?
                            
                                Linked lists in C++
                            
                                C++ function template argument with templated type struct woes
                            
                                Pixel width using glPointSize - no effect
                            
                                Compiler error vs linker error? [closed]
                            
                                std::make_pair : cannot convert 'ch' (type 'char') to type 'char&&' [duplicate]
                            
                                Boost Asio HTTPS request giving 'certificate verify failed' error
                            
                                C++ - MPIR: mpz_t to std::string?
                            
                                const char * and char *
                            
                                How to configure TCP_KEEPALIVE under MAC OS X
                            
                                How does a compiler get a const's address in C++?
                            
                                C++/Qt unresolved external when calling constructor
                            
                                CreateRemoteThread access denied
                            
                                Null vs ZeroMemory
                            
                                Cannot dynamic cast when using dynamic_pointer_cast
                            
                                Do I need to use volatile keyword if I declare a variable between mutexes and return it?
                            
                                Convert Mat to QPixmap

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Rewriting slow R function in C++ & Rcpp

Tags:

c++

r

vector

rcpp

Ward9250

People also ask

1 Answers

flodel

Recent Activity

Donate For Us