Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: agrep results quantifier

Tags:

r

agrep

Is there a built-in way to quantify results of agrep function? E.g. in

agrep("test", c("tesr", "teqr", "toar"), max = 2, v=T)
[1] "tesr" "teqr"

tesr is only 1 char permutation away from test, while teqr is 2, and toar is 3 and hence not found. Apparently, tesr has higher "probability" than teqr. How can it be retrieved either in number of permutations or percentage? Thanks!

Edit: Apologies for not putting this in question in first place. I am already running a two-step procedure: agrep to get my list, and then adist to get N permutations. adist is slower, running time is a big factor in my dataset

like image 344
Alexey Ferapontov Avatar asked Mar 14 '23 08:03

Alexey Ferapontov


2 Answers

Another option using adist():

s <- c("tesr", "teqr", "toar")
s[adist("test", s) < 3]

Or using stringdist

library(stringdist)
s[stringdist("test", s, method = "lv") < 3]

Which gives:

#[1] "tesr" "teqr"

Benchmark

x <- rep(s, 10e5)
library(microbenchmark)
mbm <- microbenchmark(
  levenshteinDist = x[which(levenshteinDist("test", x) < 3)],
  adist = x[adist("test", x) < 3],
  stringdist = x[stringdist("test", x, method = "lv") < 3],
  times = 10
)

Which gives: enter image description here

Unit: milliseconds
            expr       min        lq      mean    median        uq       max neval cld
 levenshteinDist  840.7897 1255.1183 1406.8887 1398.4502 1510.5398 1960.4730    10  b 
           adist 2760.7677 2905.5958 2993.9021 2986.1997 3038.7692 3472.7767    10   c
      stringdist  145.8252  155.3228  210.4206  174.5924  294.8686  355.1552    10 a  
like image 137
Steven Beaupré Avatar answered Mar 20 '23 03:03

Steven Beaupré


The Levenshtein distance is the number of edits from one string to another. The package 'RecordLinkage' may be of interest. It provides the edit distance computation below, which should perform on par with agrep. Although it will not return the same results as agrep.

library(RecordLinkage)
ld <- levenshteinDist("test", c("tesr", "teqr", "toar"))
c("tesr", "teqr", "toar")[which(ld < 3)]
like image 44
vpipkt Avatar answered Mar 20 '23 04:03

vpipkt