Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

stringdist on one vector

Tags:

r

stringdist

I'm trying to use stringdist to identify all strings with a max distance of 1 in the same vector, and then publish the match. Here is a sample of the data:

Starting data frame:

a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell") 
b = c(NA) 
df = data.frame(a,b) 

Desired results:

a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell") 
b = c("tomm", "tom", "alexi", "alex", 0, "jenn", "jen", 0) 
df = data.frame(a,b) 

I can use stringdist for two vectors, but am having trouble using it for one vector. Thanks for your help, R community.

like image 472
richiepop2 Avatar asked Nov 23 '25 09:11

richiepop2


2 Answers

Here's one possible approach:

a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn", "michell") 

min_dist <- function(x, method = "cosine", tol = .5){
    y <- vector(mode = "character", length = length(x))
    for(i in seq_along(x)){
        dis <- stringdist(x[i], x[-i], method)
        if (min(dis) > tol) {
            y[i] <- "0"
        } else {
            y[i] <- x[-i][which.min(dis)]
        }
    }
    y
}

min_dist(a, 'cosine', .4)

## [1] "tomm"  "tom"   "alexi" "alex"  "0"      "jenn"  "jen"   "0"
like image 93
Tyler Rinker Avatar answered Nov 24 '25 21:11

Tyler Rinker


You can use stringdistmatrix and which.min:

df = data.frame(a,b, stringsAsFactors = FALSE)
mat <- stringdistmatrix(df$a, df$a)
mat[mat==0] <- NA # ignore self
mat[mat>4] <- NA  # cut level
amatch <- rowSums(mat, na.rm = TRUE)>0 # ignore no match
df$b[amatch] <- df$a[apply(mat[amatch,],1,which.min)]
        a     b
1     tom  tomm
2    tomm   tom
3    alex alexi
4   alexi  alex
5   chris  <NA>
6     jen  jenn
7    jenn   jen
8 michell  <NA>
like image 43
HubertL Avatar answered Nov 24 '25 23:11

HubertL



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!