Is there a built-in way to quantify results of agrep
function? E.g. in
agrep("test", c("tesr", "teqr", "toar"), max = 2, v=T)
[1] "tesr" "teqr"
tesr
is only 1 char permutation away from test
, while teqr
is 2, and toar
is 3 and hence not found. Apparently, tesr
has higher "probability" than teqr
. How can it be retrieved either in number of permutations or percentage?
Thanks!
Edit: Apologies for not putting this in question in first place. I am already running a two-step procedure: agrep
to get my list, and then adist
to get N permutations. adist
is slower, running time is a big factor in my dataset
Another option using adist()
:
s <- c("tesr", "teqr", "toar")
s[adist("test", s) < 3]
Or using stringdist
library(stringdist)
s[stringdist("test", s, method = "lv") < 3]
Which gives:
#[1] "tesr" "teqr"
Benchmark
x <- rep(s, 10e5)
library(microbenchmark)
mbm <- microbenchmark(
levenshteinDist = x[which(levenshteinDist("test", x) < 3)],
adist = x[adist("test", x) < 3],
stringdist = x[stringdist("test", x, method = "lv") < 3],
times = 10
)
Which gives:
Unit: milliseconds
expr min lq mean median uq max neval cld
levenshteinDist 840.7897 1255.1183 1406.8887 1398.4502 1510.5398 1960.4730 10 b
adist 2760.7677 2905.5958 2993.9021 2986.1997 3038.7692 3472.7767 10 c
stringdist 145.8252 155.3228 210.4206 174.5924 294.8686 355.1552 10 a
The Levenshtein distance is the number of edits from one string to another. The package 'RecordLinkage' may be of interest. It provides the edit distance computation below, which should perform on par with agrep
. Although it will not return the same results as agrep
.
library(RecordLinkage)
ld <- levenshteinDist("test", c("tesr", "teqr", "toar"))
c("tesr", "teqr", "toar")[which(ld < 3)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With