Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find indices of values within tolerance range in R

Tags:

r

say I have vector x

x <- c(1, 1, 1.1, 2, 1, 2.1, 2.6)
tol <- 0.4

how do I get the indices of the groups of elements that are 'unique' within the tolerance range (tol) as in the list below. I don't know how many of these groups there are beforehand.

[[1]]
[1] 1 2 3 5

[[2]]
[1] 4 6

[[3]]
[1] 7

thanks

like image 670
Lukas Avatar asked Sep 14 '16 09:09

Lukas


2 Answers

Not 100% reliable, since it uses unique on lists, but you can try:

unique(apply(outer(x,x,function(a,b) abs(a-b)<tol),1,which))
#[[1]]
#[1] 1 2 3 5
#
#[[2]]
#[1] 4 6
#
#[[3]]
#[1] 7

The point @Roland raised in the comments showed that there is some ambiguity in your requirements. For instance if x<-c(1, 1.3, 1.6), my line gives three groups: 1-2, 2-3 and 1-2-3. This because, from the 1 point of view, it is similar only to 1.3, but from 1.3 point of view, it is similar to both 1 and 1.6.

like image 83
nicola Avatar answered Oct 20 '22 18:10

nicola


An alternative using nn2 from RANN to find nearest neighbors within radius for clustering:

library(RANN)

x <- c(1, 1, 1.1, 2, 1, 2.1, 2.6)
tol=0.4
nn <- nn2(x,x,k=length(x),searchtype="radius",radius=tol)
m <- unique(apply(nn$nn.idx,1,sort), MARGIN=2)
sapply(seq_len(ncol(m)), function(i) m[which(m[,i] > 0),i])
##[[1]]
##[1] 1 2 3 5
##
##[[2]]
##[1] 4 6
##
##[[3]]
##[1] 7

x <- c(1, 1.3, 1.6)
nn <- nn2(x,x,k=length(x),searchtype="radius",radius=tol)
m <- unique(apply(nn$nn.idx,1,sort), MARGIN=2)  
sapply(seq_len(ncol(m)), function(i) m[which(m[,i] > 0),i])
##[[1]]
##[1] 1 2
##
##[[2]]
##[1] 1 2 3
##
##[[3]]
##[1] 2 3

Notes:

  1. The call to nn2 finds all nearest neighbors for each element of x with respect to all elements of x within a radius equalling the tol. The result nn$nn.idx is a matrix whose rows contain the indices that are nearest neighbors for each element in x. The matrix is dense and filled with zeroes as needed.
  2. Clustering is performed by sorting each row so that unique rows can be extracted. The output m is a matrix where each column contains the indices in a cluster. Again, this matrix is dense and filled with zeroes as needed.
  3. The resulting list is extracted by subsetting each column to remove the zero entries.

This is likely more efficient for large x because nn2 uses a KD-Tree, but it suffers from the same issue for elements that overlap (with respect to the tolerance) as pointed out by nicola.

like image 43
aichao Avatar answered Oct 20 '22 18:10

aichao