I discovered the excellent package "stringdist" and now want to use it to compute string distances. In particular I have a set of words, and I want to print out near-matches, where "near match" is through some algorithm like the Levenshtein distance.
I have extremely slow working code in a shell script, and I was able to load in stringdist and produce a matrix with metrics. Now I want to boil down that matrix into a smaller matrix that only has the near matches, e.g. where the metric is non-zero but less that some threshold.
kp <- c('leaflet','leafletr','lego','levenshtein-distance','logo')
kpm <- stringdistmatrix(kp,useNames="strings",method="lv")
> kpm
leaflet leafletr lego levenshtein-distance
leafletr 1
lego 5 6
levenshtein-distance 16 16 18
logo 6 7 1 19
m = as.matrix(kpm)
close = apply(m, 1, function(x) x>0 & x<5)
> close
leaflet leafletr lego levenshtein-distance logo
leaflet FALSE TRUE FALSE FALSE FALSE
leafletr TRUE FALSE FALSE FALSE FALSE
lego FALSE FALSE FALSE FALSE TRUE
levenshtein-distance FALSE FALSE FALSE FALSE FALSE
logo FALSE FALSE TRUE FALSE FALSE
OK, now I have a (big) dist, how do I reduce it back to a list where the output would be something like
leafletr,leaflet,1
logo,lego,1
for cases only where the metric is non-zero and less than n=5? I found "apply()" which lets me do the test, now I need to sort out how to use it.
The problem is not specific to stringdist and stringdistmatrix and is very elementary R, but still I'm stuck. I suspect the answer involves subset(), but I don't know how to transform a "dist" into something else.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With