Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Testing if rows of a matrix or data frame are sorted in R

Tags:

sorting

r

matlab

What is an efficient way to test if rows in a matrix are sorted? [Update: see Aaron's Rcpp answer - straightforward & very fast.]

I am porting some code that uses issorted(,'rows') from Matlab. As it seems that is.unsorted does not extend beyond vectors, I'm writing or looking for something else. The naive method is to check that the sorted version of the matrix (or data frame) is the same as the original, but that's obviously inefficient.

NB: For sorting, a la sortrows() in Matlab, my code essentially uses SortedDF <- DF[do.call(order, DF),] (it's wrapped in a larger function that converts matrices to data frames, passes parameters to order, etc.). I wouldn't be surprised if there are faster implementations (data table comes to mind).


Update 1: To clarify: I'm not testing for sorting intra-row or intra-columns. (Such sorting generally results in an algebraically different matrix.)

As an example for creating an unsorted matrix:

set.seed(0)
x <- as.data.frame(matrix(sample(3, 60, replace = TRUE), ncol = 6, byrow = TRUE))

Its sorted version is:

y <- x[do.call(order, x),]

A proper test, say testSorted, would return FALSE for testSorted(x) and TRUE for testSorted(y).

Update 2: The answers below are all good - they are concise and do the test. Regarding efficiency, it looks like these are sorting the data after all.

I've tried these with rather large matrices, such as 1M x 10, (just changing the creation of x above) and all have about the same time and memory cost. What's peculiar is that they all consume more time for unsorted objects (about 5.5 seconds for 1Mx10) than for sorted ones (about 0.5 seconds for y). This suggests they're sorting before testing.

I tested by creating a z matrix:

z <- y
z[,2] <- y[,1]
z[,1] <- y[,2]

In this case, all of the methods take about 0.85 seconds to complete. Anyway, finishing in 5.5 seconds isn't terrible (in fact, that seems to be right about the time necessary to sort the object), but knowing that a sorted matrix is 11X faster suggests that a test that doesn't sort could be even faster. In the case of the 1M row matrix, the first three rows of x are:

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1  3  1  2  2  3  1  3  3  2   2
2  1  1  1  3  2  3  2  3  3   2
3  3  3  1  2  1  1  2  1  2   3

There's no need to look beyond row 2, though vectorization isn't a bad idea.

(I've also added the byrow argument for the creation of x, so that row values don't depend on the size of x.)

Update 3: Another comparison for this testing can be found with the sort -c command in Linux. If the file is already written (using write.table()), with 1M rows, then time sort -c myfile.txt takes 0.003 seconds for the unsorted data and 0.101 seconds for the sorted data. I don't intend to write out to a file, but it's a useful comparison.

Update 4: Aaron's Rcpp method bested all other methods offered here and that I've tried (including the sort -c comparison above: in-memory is expected to beat on-disk). As for the ratio relative to other methods, it's hard to tell: the denominator is too small to give an accurate measurement, and I've not extensively explored microbenchmark. The speedups can be very large (4-5 orders of magnitude) for some matrices (e.g. one made with rnorm), but this is misleading - checking can terminate after only a couple of rows. I've had speedups with the example matrices of about 25-60 for the unsorted and about 1.1X for the sorted, as the competing methods were already very fast if the data is sorted.

Since this does the right thing (i.e. no sorting, just testing), and does it very quickly, it's the accepted answer.

like image 853
Iterator Avatar asked Sep 29 '11 14:09

Iterator


People also ask

How do I check if a list is sorted in R?

unsorted() function in R Language is used to check if an object is sorted or not. It returns False if the object is sorted otherwise True.

How do I sort a row in a matrix in R?

To sort each row of an R data frame in increasing order, we can use apply function for sorting the columns and then transpose the output. For example, if we have a data frame called df that contains 5 columns then each row of df can be sorted in increasing order by using the command t(apply(df,1,sort)).

How do I select specific rows with conditions in R?

By Using subset() R base also provides a subset() function that can be used to select rows based on the logical condition of a column.


2 Answers

If y is sorted then do.call(order,y) returns 1:nrow(y).

 testSorted = function(y){all(do.call(order,y)==1:nrow(y))}

note this doesn't compare the matrices, but it doesn't dip out as soon as it finds a non-match.

like image 82
Spacedman Avatar answered Oct 29 '22 11:10

Spacedman


Well, why don't you use:

all(do.call(order, y)==seq(nrow(y)))

That avoids creating the ordered matrix, and ensures it checks your style of ordering.

like image 25
Nick Sabbe Avatar answered Oct 29 '22 11:10

Nick Sabbe