I have a function that take an m x n sized (potentially) binary matrix as input, and I would like to return an error handling if the matrix contains a number that is not 0 or 1, or is NA. How can I check this efficiently?
For instance, by generating some data for a 10 x 10:
> n=10;m=10
> mat = round(matrix(runif(m*n), m, n))
> mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 1 0 1 1 0 1 0 1 0
[2,] 0 0 0 0 0 0 0 0 0 1
[3,] 1 1 0 1 1 0 0 1 1 0
[4,] 1 1 1 1 0 1 0 0 1 1
[5,] 1 1 1 0 0 1 1 1 0 1
[6,] 1 0 1 0 0 0 0 1 0 0
[7,] 0 0 0 1 0 1 1 1 1 0
[8,] 0 0 0 1 0 1 1 1 1 1
[9,] 0 0 1 1 0 1 1 1 1 1
[10,] 1 0 1 1 0 0 0 0 1 1
should always return that the matrix is binary, but changing it in one of the following ways:
> mat[1,1]=NA
> mat[1,1]=2
should return that the matrix is not binary.
Currently, I have been using in my function:
for(i in 1:nrow(mat))
{
for(j in 1:ncol(mat))
{
if(is.na(mat[i,j])|(!(mat[i,j] == 1 | mat[i,j] == 0)))
{
stop("Data must be only 0s, 1s")
}
}
}
but it seems awfully slow and inefficient to individually check every value for large matrices. Is there a clever, easy way to do this I'm missing?
Thanks
I immediately thought of identical(mat,matrix(as.numeric(as.logical(mat),nr=nrow(mat)) ) )
This leaves NA
as NA
so if you want to identify the existence of such, you'll just need a quick any(is.na(mat))
or similar test.
EDIT: time trial
fun2 <- function(x) {
all(x %in% 0:1)
}
fun1 <-function(x) {identical(as.vector(x),as.numeric(as.logical(x)))}
mfoo<-matrix(sample(0:10,1e6,rep=TRUE),1e3)
microbenchmark(fun1(mfoo),fun2(mfoo),is.binary.sum2(mfoo),times=10)
Unit: milliseconds
expr min lq median uq
fun1(mfoo) 2.286941 2.809926 2.835584 2.865518
fun2(mfoo) 20.369075 20.894627 21.100528 21.226464
is.binary.sum2(mfoo) 11.394503 12.418238 12.431922 12.458436
max neval
2.920253 10
21.407777 10
28.316492 10
And against the not...
thing: I had to throw in a try
to avoid breaking the test.
notfun <- function(mat) try(stopifnot(sum(mat==0) + sum(mat==1) == length(mat)))
microbenchmark(fun1(mfoo),notfun(mfoo),is.binary.sum2(mfoo),times=10)
Error : sum(mat == 0) + sum(mat == 1) == length(mat) is not TRUE
##error repeated 10x for the 10 trials
Unit: milliseconds
expr min lq median uq
fun1(mfoo) 4.870653 4.978414 5.057524 5.268344
notfun(mfoo) 18.149273 18.685942 18.942518 19.241856
is.binary.sum2(mfoo) 11.428713 12.145842 12.516165 12.605111
max neval
5.438111 10
34.826230 10
13.090465 10
I win! :-)
Here are timings for a few options (including options suggested in other answers):
n=5000;m=5000
mat = round(matrix(runif(m*n), m, n))
> system.time(stopifnot(sum(mat==0) + sum(mat==1) == length(mat)))
user system elapsed
0.30 0.02 0.31
> system.time(stopifnot(all(mat %in% c(0,1))))
user system elapsed
0.58 0.06 0.63
> system.time(stopifnot(all(mat==0 | mat==1)))
user system elapsed
0.77 0.03 0.80
They're all pretty fast, considering it's a 5000 by 5000 matrix! The fastest of the three seems to be:
stopifnot(sum(mat==0) + sum(mat==1) == length(mat))
I like to add a slightly modified version of the sum
based comparison that is faster than @JamesTrimble's version. I hope all my assumptions are correct:
is.binary.sum2 <- function(x) {
identical(sum(abs(x)) - sum(x == 1), 0)
}
Here the benchmark:
library(rbenchmark)
n=5000
m=5000
mat = round(matrix(runif(m*n), m, n))
is.binary.sum <- function(x) {
sum(x == 0) + sum(x == 1) == length(x)
}
is.binary.sum2 <- function(x) {
identical(sum(abs(x)) - sum(x == 1), 0)
}
is.binary.all <- function(x) {
all(x == 0 | x == 1)
}
is.binary.in <- function(x) {
all(x %in% c(0, 1))
}
benchmark(is.binary.sum(mat), is.binary.sum2(mat),
is.binary.all(mat), is.binary.in(mat),
order="relative", replications=10)
# test replications elapsed relative user.self sys.self user.child sys.child
#2 is.binary.sum2(mat) 10 4.635 1.000 3.872 0.744 0 0
#1 is.binary.sum(mat) 10 7.097 1.531 6.565 0.512 0 0
#4 is.binary.in(mat) 10 10.359 2.235 9.216 1.108 0 0
#3 is.binary.all(mat) 10 12.565 2.711 11.753 0.772 0 0
A quite efficient (and readable) way could be
all(mat %in% c(0,1))
However as pointed out it may not be the most fast, if compared to other solutions.
But, to add a few, if efficiency is a must (eg you do this test a lot
of times) a lot of gain is given by working with integer
matrix
(double
s have more bytes) and check against integer
values.
This gain could also apply to other solutions as well.
A few tests with %in%
follow:
library(microbenchmark)
set.seed(1)
my.dim <- 1e04
n <- my.dim
m <- my.dim
mat <- round(matrix(runif(m*n), m, n))
int.mat <- as.integer(mat)
fun1 <- function(x) {
all(x %in% c(0,1))
}
fun2 <- function(x) {
all(x %in% 0:1)
}
## why?
storage.mode(0:1)
## [1] "integer"
storage.mode(c(0,1))
## [1] "double"
object.size(0:1)
## 48 bytes
object.size(c(0,1))
## 56 bytes
## and considering mat and int.mat
object.size(mat)
## 800000200 bytes
object.size(int.mat)
## 400000040 bytes
(res <- microbenchmark(fun1(mat), fun2(int.mat), times = 10, unit = "s"))
## Unit: seconds
## expr min lq median uq max neval
## fun1(mat) 3.68843 3.69325 3.70433 3.72627 3.73041 10
## fun2(int.mat) 1.28956 1.29157 1.32934 1.34370 1.35718 10
From 3.70 to 1.32 is not that bad :)
Note, I changed a few things so it runs in octave
, but it should be pretty similar to matlab
.
Generate the matrix:
n=5000;m=5000
mat=randi([0,1],n,m);
Now we just do something simple, we know that 1*2-1
would make the 1
equal to 1
, while it makes 0
equal to -1
. So, abs
makes it all the same. For any other value, say -1
, -1*2-1=-3
this isn't the case. Then we subtract 1
and we should be left with a matrix with only zeros. This can be easily checked in matlab/octave with any
:
any(any(abs(mat*2-1)-1));
Checking its speed:
mat=randi([0,1],n,m);
[t0 u0 s0]=cputime(); any(any(abs(mat+mat-1)-1)); [t1 u1 s1]=cputime(); [t1-t0 u1-u0 s1-s0]
ans =
0.176772 0.127546 0.049226
In the order total
, user
, and system
time.
Pretty decent under 0.18
seconds with most of it in user mode. With 10.000 * 10.000
entries it is still under a second, clocking in at 0.86
seconds on my system.
Oh, heck, I only now see it is actually asked for R
, not matlab
. I hope someone likes the comparison though.
Handling NaN
values is easy in octave
/matlab
with isnan(mat)
, eventually in the form of any(any(isnan(mat)))
if you like. This includes NA
values. Handling only NA
values is through isna(mat)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With