Is there any easy way to order a DataFrame by two (or more or one) of its columns within RCpp?
There are many sorting algorithms available on the net, or I can use std::sort
with a wrapper for DataFrame, but I was wondering if there is something already available within either RCpp or RCppArmadillo?
I need to do this sorting / ordering as a part of another function
DataFrame myFunc(DataFrame myDF, NumericVector x) {
//// some code here
DataFrame myDFsorted = sort (myDF, someColName1, someColName2) // how to sort??
//// some code here
}
I would like to avoid accessing R's order
function within RCpp (for retaining speed of the RCpp code).
Many thanks
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order.
Method 1: User order() from base R Here order() function is used to sort the dataframe by R using order() function based on the date column, we have to convert the date column to date with the format, this will sort in ascending order.
The difficulty is that a data frame is a set of vectors, potentially of different types; We need a way to order them independently of these types (integer, character, ...). In dplyr, we have developed what we call vector visitors. For this particular problem, what we need is a set of OrderVisitor
, which exhibit the following interface:
class OrderVisitor {
public:
virtual ~OrderVisitor(){}
/** are the elements at indices i and j equal */
virtual bool equal(int i, int j) const = 0 ;
/** is the i element less than the j element */
virtual bool before( int i, int j) const = 0 ;
virtual SEXP get() = 0 ;
} ;
dplyr then has implementations of OrderVisitor
for all types we are supporting in this file and we have a dispatcher function order_visitor
that makes an OrderVisitor*
from a vector.
With this, we can store a set of vector visitors into a std::vector<OrderVisitor*>
; The OrderVisitors has a constructor taking a DataFrame
and a CharacterVector
of names of vectors we want to use for the ordering.
OrderVisitors o(data, names ) ;
Then we can use the OrderVisitors.apply
method which essentially does lexicographic ordering:
IntegerVector index = o.apply() ;
The apply
method is implemented by simply initializing an IntegerVector
with 0..n
and then std::sort
it according to the visitors.
inline Rcpp::IntegerVector OrderVisitors::apply() const {
IntegerVector x = seq(0, nrows -1 ) ;
std::sort( x.begin(), x.end(), OrderVisitors_Compare(*this) ) ;
return x ;
}
The relevant thing here is how the OrderVisitors_Compare
class implements operator()(int,int)
:
inline bool operator()(int i, int j) const {
if( i == j ) return false ;
for( int k=0; k<n; k++)
if( ! obj.visitors[k]->equal(i,j) )
return obj.visitors[k]->before(i, j ) ;
return i < j ;
}
So at this point index
gives us the integer indices of the sorted data, we just have to make a new DataFrame
from data
by subsetting data
with these indices. For this we have another kind of visitors, encapsulated in the DataFrameVisitors
class. We first create a DataFrameVisitors
:
DataFrameVisitors visitors( data ) ;
This encapsulates a std::vector<VectorVisitor*>
. Each of these VectorVisitor*
knows how to subset itself with an integer vector index. This is used from DataFrameVisitors.subset
:
template <typename Container>
DataFrame subset( const Container& index, const CharacterVector& classes ) const {
List out(nvisitors);
for( int k=0; k<nvisitors; k++){
out[k] = get(k)->subset(index) ;
}
structure( out, Rf_length(out[0]) , classes) ;
return (SEXP)out ;
}
To wrap this up, here is a simple function using tools developped in dplyr:
#include <dplyr.h>
// [[Rcpp::depends(dplyr)]]
using namespace Rcpp ;
using namespace dplyr ;
// [[Rcpp::export]]
DataFrame myFunc(DataFrame data, CharacterVector names) {
OrderVisitors o(data, names ) ;
IntegerVector index = o.apply() ;
DataFrameVisitors visitors( data ) ;
DataFrame res = visitors.subset(index, "data.frame" ) ;
return res ;
}
Because a data.frame
is really a list of columns at the C++, you would have to re-order all your columns individually given a new ording index. This is different from how [.., ..]
indexing works in R for a data.frame
.
See e.g. this Rcpp Gallery article on sorting vectors for some pointers. You will probably have to supply the new ordering index to be used, after which it is just an indexing question -- and that too has some posts on the Gallery.
This SO post may get you started on the index creation; this bytes.com post discusses the same idea.
Edit: And Armadillo has function sort_index()
and stable_sort_index()
to create the index you need for re-arranging your columns. This only covers the one column case, and is limited to numerical columns, but is a start.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With