Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: producing a list of near matches with stringdist and stringdistmatrix

Tags:

I discovered the excellent package "stringdist" and now want to use it to compute string distances. In particular I have a set of words, and I want to print out near-matches, where "near match" is through some algorithm like the Levenshtein distance.

I have extremely slow working code in a shell script, and I was able to load in stringdist and produce a matrix with metrics. Now I want to boil down that matrix into a smaller matrix that only has the near matches, e.g. where the metric is non-zero but less that some threshold.

kp <-  c('leaflet','leafletr','lego','levenshtein-distance','logo')
kpm <- stringdistmatrix(kp,useNames="strings",method="lv")
> kpm
                     leaflet leafletr lego levenshtein-distance
leafletr                   1                                   
lego                       5        6                          
levenshtein-distance      16       16   18                     
logo                       6        7    1                   19
m = as.matrix(kpm)
close = apply(m, 1, function(x) x>0 & x<5)
>  close
                     leaflet leafletr  lego levenshtein-distance  logo
 leaflet                FALSE     TRUE FALSE                FALSE FALSE
 leafletr                TRUE    FALSE FALSE                FALSE FALSE
 lego                   FALSE    FALSE FALSE                FALSE  TRUE
 levenshtein-distance   FALSE    FALSE FALSE                FALSE FALSE
 logo                   FALSE    FALSE  TRUE                FALSE FALSE

OK, now I have a (big) dist, how do I reduce it back to a list where the output would be something like

leafletr,leaflet,1
logo,lego,1

for cases only where the metric is non-zero and less than n=5? I found "apply()" which lets me do the test, now I need to sort out how to use it.

The problem is not specific to stringdist and stringdistmatrix and is very elementary R, but still I'm stuck. I suspect the answer involves subset(), but I don't know how to transform a "dist" into something else.