Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently check if a matrix is in binary form (e.g. all 1's or 0's)?

Tags:

r

matrix

I have a function that take an m x n sized (potentially) binary matrix as input, and I would like to return an error handling if the matrix contains a number that is not 0 or 1, or is NA. How can I check this efficiently?

For instance, by generating some data for a 10 x 10:

> n=10;m=10
> mat = round(matrix(runif(m*n), m, n))
> mat
        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    0    1    0    1    1    0    1    0    1     0
 [2,]    0    0    0    0    0    0    0    0    0     1
 [3,]    1    1    0    1    1    0    0    1    1     0
 [4,]    1    1    1    1    0    1    0    0    1     1
 [5,]    1    1    1    0    0    1    1    1    0     1
 [6,]    1    0    1    0    0    0    0    1    0     0
 [7,]    0    0    0    1    0    1    1    1    1     0
 [8,]    0    0    0    1    0    1    1    1    1     1
 [9,]    0    0    1    1    0    1    1    1    1     1
[10,]    1    0    1    1    0    0    0    0    1     1

should always return that the matrix is binary, but changing it in one of the following ways:

> mat[1,1]=NA
> mat[1,1]=2

should return that the matrix is not binary.

Currently, I have been using in my function:

for(i in 1:nrow(mat))
{
    for(j in 1:ncol(mat))
    {
      if(is.na(mat[i,j])|(!(mat[i,j] == 1 | mat[i,j] == 0)))
      {
        stop("Data must be only 0s, 1s")
      }
    }
}

but it seems awfully slow and inefficient to individually check every value for large matrices. Is there a clever, easy way to do this I'm missing?

Thanks

like image 281
Csislander Avatar asked Apr 24 '14 16:04

Csislander


5 Answers

I immediately thought of identical(mat,matrix(as.numeric(as.logical(mat),nr=nrow(mat)) ) )

This leaves NA as NA so if you want to identify the existence of such, you'll just need a quick any(is.na(mat)) or similar test.

EDIT: time trial

fun2 <- function(x) {
      all(x %in% 0:1)
}
fun1 <-function(x) {identical(as.vector(x),as.numeric(as.logical(x)))}

mfoo<-matrix(sample(0:10,1e6,rep=TRUE),1e3)
 microbenchmark(fun1(mfoo),fun2(mfoo),is.binary.sum2(mfoo),times=10)
Unit: milliseconds
                 expr       min        lq    median        uq
           fun1(mfoo)  2.286941  2.809926  2.835584  2.865518
           fun2(mfoo) 20.369075 20.894627 21.100528 21.226464
 is.binary.sum2(mfoo) 11.394503 12.418238 12.431922 12.458436
       max neval
  2.920253    10
 21.407777    10
 28.316492    10

And against the not... thing: I had to throw in a try to avoid breaking the test.

notfun <- function(mat) try(stopifnot(sum(mat==0) + sum(mat==1) == length(mat)))
 microbenchmark(fun1(mfoo),notfun(mfoo),is.binary.sum2(mfoo),times=10)
Error : sum(mat == 0) + sum(mat == 1) == length(mat) is not TRUE
##error repeated 10x for the 10 trials
Unit: milliseconds
                 expr       min        lq    median        uq
           fun1(mfoo)  4.870653  4.978414  5.057524  5.268344
         notfun(mfoo) 18.149273 18.685942 18.942518 19.241856
 is.binary.sum2(mfoo) 11.428713 12.145842 12.516165 12.605111
       max neval
  5.438111    10
 34.826230    10
 13.090465    10

I win! :-)

like image 181
Carl Witthoft Avatar answered Oct 07 '22 19:10

Carl Witthoft


Here are timings for a few options (including options suggested in other answers):

n=5000;m=5000
mat = round(matrix(runif(m*n), m, n))
> system.time(stopifnot(sum(mat==0) + sum(mat==1) == length(mat)))
   user  system elapsed 
   0.30    0.02    0.31 
> system.time(stopifnot(all(mat %in% c(0,1))))
   user  system elapsed 
   0.58    0.06    0.63 
> system.time(stopifnot(all(mat==0 | mat==1)))
   user  system elapsed 
   0.77    0.03    0.80 

They're all pretty fast, considering it's a 5000 by 5000 matrix! The fastest of the three seems to be:

stopifnot(sum(mat==0) + sum(mat==1) == length(mat))
like image 22
James Trimble Avatar answered Oct 07 '22 19:10

James Trimble


I like to add a slightly modified version of the sum based comparison that is faster than @JamesTrimble's version. I hope all my assumptions are correct:

is.binary.sum2 <- function(x) {
  identical(sum(abs(x)) - sum(x == 1), 0)
}

Here the benchmark:

library(rbenchmark)

n=5000
m=5000
mat = round(matrix(runif(m*n), m, n))

is.binary.sum <- function(x) {
  sum(x == 0) + sum(x == 1) == length(x)
}

is.binary.sum2 <- function(x) {
  identical(sum(abs(x)) - sum(x == 1), 0)
}

is.binary.all <- function(x) {
  all(x == 0 | x == 1)
}

is.binary.in <- function(x) {
  all(x %in% c(0, 1))
}

benchmark(is.binary.sum(mat), is.binary.sum2(mat),
          is.binary.all(mat), is.binary.in(mat),
          order="relative", replications=10)
#                 test replications elapsed relative user.self sys.self user.child sys.child
#2 is.binary.sum2(mat)           10   4.635    1.000     3.872    0.744          0         0
#1  is.binary.sum(mat)           10   7.097    1.531     6.565    0.512          0         0
#4   is.binary.in(mat)           10  10.359    2.235     9.216    1.108          0         0
#3  is.binary.all(mat)           10  12.565    2.711    11.753    0.772          0         0
like image 24
sgibb Avatar answered Oct 07 '22 19:10

sgibb


A quite efficient (and readable) way could be

all(mat %in% c(0,1))

However as pointed out it may not be the most fast, if compared to other solutions.

But, to add a few, if efficiency is a must (eg you do this test a lot of times) a lot of gain is given by working with integer matrix (doubles have more bytes) and check against integer values. This gain could also apply to other solutions as well. A few tests with %in%follow:

library(microbenchmark)
set.seed(1)

my.dim <- 1e04
n <- my.dim
m <- my.dim
mat <- round(matrix(runif(m*n), m, n))
int.mat <- as.integer(mat)

fun1 <- function(x) {
      all(x %in% c(0,1))
}
fun2 <- function(x) {
      all(x %in% 0:1)
}

## why?
storage.mode(0:1)
## [1] "integer"
storage.mode(c(0,1))
## [1] "double"
object.size(0:1)
## 48 bytes
object.size(c(0,1))
## 56 bytes
## and considering mat and int.mat
object.size(mat)
## 800000200 bytes
object.size(int.mat)
## 400000040 bytes

(res <- microbenchmark(fun1(mat), fun2(int.mat), times = 10, unit = "s"))
## Unit: seconds
##           expr     min      lq  median      uq     max neval
##      fun1(mat) 3.68843 3.69325 3.70433 3.72627 3.73041    10
##  fun2(int.mat) 1.28956 1.29157 1.32934 1.34370 1.35718    10

From 3.70 to 1.32 is not that bad :)

like image 31
Luca Braglia Avatar answered Oct 07 '22 19:10

Luca Braglia


Note, I changed a few things so it runs in octave, but it should be pretty similar to matlab.

Generate the matrix:

n=5000;m=5000
mat=randi([0,1],n,m);

Now we just do something simple, we know that 1*2-1 would make the 1 equal to 1, while it makes 0 equal to -1. So, abs makes it all the same. For any other value, say -1, -1*2-1=-3 this isn't the case. Then we subtract 1 and we should be left with a matrix with only zeros. This can be easily checked in matlab/octave with any:

any(any(abs(mat*2-1)-1));

Checking its speed:

mat=randi([0,1],n,m);
[t0 u0 s0]=cputime(); any(any(abs(mat+mat-1)-1)); [t1 u1 s1]=cputime(); [t1-t0 u1-u0 s1-s0]
ans =
 0.176772   0.127546   0.049226

In the order total, user, and system time.

Pretty decent under 0.18 seconds with most of it in user mode. With 10.000 * 10.000 entries it is still under a second, clocking in at 0.86 seconds on my system.

Oh, heck, I only now see it is actually asked for R, not matlab. I hope someone likes the comparison though.

Handling NaN values is easy in octave/matlab with isnan(mat), eventually in the form of any(any(isnan(mat))) if you like. This includes NA values. Handling only NA values is through isna(mat).

like image 1
Anne van Rossum Avatar answered Oct 07 '22 20:10

Anne van Rossum