Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to measure similarity between strings?

I have a bunch of names, and I want to obtain the unique names. However, due to spelling errors and inconsistencies in the data the names might be written down wrong. I am looking for a way to check in a vector of strings if two of them are similair.

For example:

pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.") 

I want to find that " Obama, B." and "Obama, B.H." are very similar. Is there a way to do this?

like image 805
Sacha Epskamp Avatar asked May 18 '11 11:05

Sacha Epskamp


People also ask

How do you measure similarity?

To calculate the similarity between two examples, you need to combine all the feature data for those two examples into a single numeric value. For instance, consider a shoe data set with only one feature: shoe size. You can quantify how similar two shoes are by calculating the difference between their sizes.

How do you find the similarity measure between two sets?

Typically, the Jaccard similarity coefficient (or index) is used to compare the similarity between two sets. For two sets, A and B , the Jaccard index is defined to be the ratio of the size of their intersection and the size of their union: J(A,B) = (A ∩ B) / (A ∪ B)

What is string similarity search?

String similarity search is a fundamental query that has been widely used for DNA sequencing, error-tolerant query auto-completion, and data cleaning needed in database, data warehouse, and data mining.


1 Answers

This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:

  • agrep: only return best match(es)
  • In R, how do I replace a string that contains a certain pattern with another string?
  • Fast Levenshtein distance in R?

But most often agrep will do what you want :

> sapply(pres,agrep,pres) $` Obama, B.` [1] 1 3  $`Bush, G.W.` [1] 2  $`Obama, B.H.` [1] 1 3  $`Clinton, W.J.` [1] 4 
like image 176
Joris Meys Avatar answered Oct 03 '22 10:10

Joris Meys