Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recognizing similarity in strings

I'm working on a system which allows imported files to be localized into other languages.

This is mostly a private project to get the hang of MVC3, EntityFramework, LINQ, etcetera. Therefore I like doing some crazy things to spice up the end result, one of those things would be the recognition of similar strings.

Imagine you have the following list of strings - borrowed from a game I've worked with in the past:

  • Megabeth: Holy Roller Uniform - Includes Head, Torso, and Legs
  • Megabeth: Holy Roller Uniform Head
  • Megabeth: Holy Roller Uniform Legs
  • Megabeth: Holy Roller Uniform Torso
  • Megabeth: PAX East 2012 Uniform - Includes Head, Torso, and Legs
  • Megabeth: PAX East 2012 Uniform Head
  • Megabeth: PAX East 2012 Uniform Legs
  • Megabeth: PAX East 2012 Uniform Torso

As you can see, once users have translated the first 4 strings, the following 4 share a lot of similarities, in this case:

  • Megabeth
  • Uniform
  • Includes Head, Torso, and Legs
  • Head
  • Legs
  • Torso

Consider the first 4 strings are indeed already translated, when a user selects the 5th string from the list, what kind of algorithm or technique can I use to show the user the 1st string (and potentially others) under a sub-header of "Similar strings"?

Edit - A little comment on the Levenshtein Distance: I'm currently targeting 10k strings in the database. Levenshtein Distance compares string per string, so in this case 10k x (10k -1) possible combinations. How would I approach this in a feasible way? Is there a better solution that this particular algorithm?

like image 551
Lennard Fonteijn Avatar asked Oct 22 '12 20:10

Lennard Fonteijn


People also ask

How do you assess similarity?

To calculate the similarity between two examples, you need to combine all the feature data for those two examples into a single numeric value. For instance, consider a shoe data set with only one feature: shoe size. You can quantify how similar two shoes are by calculating the difference between their sizes.

How do you find the similarity of two strings in python?

ratio() to measure similarity between two strings. Pass two strings into difflib. SequenceMatcher(isjunk, a, b) with isJunk set to None to get a SequenceMatcher() object representing the similarity between the strings. Call ratio() on this object to get the ratio of matching characters to total characters.

How do you find the similarity between two text files?

The simplest way to compute the similarity between two documents using word embeddings is to compute the document centroid vector. This is the vector that's the average of all the word vectors in the document.

How do you find the similarity measure between two sets?

Typically, the Jaccard similarity coefficient (or index) is used to compare the similarity between two sets. For two sets, A and B , the Jaccard index is defined to be the ratio of the size of their intersection and the size of their union: J(A,B) = (A ∩ B) / (A ∪ B)


1 Answers

You could look into the Levenshtein Distance. Those below a certain threshold will be considered similar. Two strings that are identical will have a distance of zero.

There's a C# implementation, amongst other languages, on Rosetta Code.

like image 86
keyboardP Avatar answered Oct 11 '22 22:10

keyboardP