I'm working on a system which allows imported files to be localized into other languages.
This is mostly a private project to get the hang of MVC3, EntityFramework, LINQ, etcetera. Therefore I like doing some crazy things to spice up the end result, one of those things would be the recognition of similar strings.
Imagine you have the following list of strings - borrowed from a game I've worked with in the past:
As you can see, once users have translated the first 4 strings, the following 4 share a lot of similarities, in this case:
Consider the first 4 strings are indeed already translated, when a user selects the 5th string from the list, what kind of algorithm or technique can I use to show the user the 1st string (and potentially others) under a sub-header of "Similar strings"?
Edit - A little comment on the Levenshtein Distance: I'm currently targeting 10k strings in the database. Levenshtein Distance compares string per string, so in this case 10k x (10k -1) possible combinations. How would I approach this in a feasible way? Is there a better solution that this particular algorithm?
To calculate the similarity between two examples, you need to combine all the feature data for those two examples into a single numeric value. For instance, consider a shoe data set with only one feature: shoe size. You can quantify how similar two shoes are by calculating the difference between their sizes.
ratio() to measure similarity between two strings. Pass two strings into difflib. SequenceMatcher(isjunk, a, b) with isJunk set to None to get a SequenceMatcher() object representing the similarity between the strings. Call ratio() on this object to get the ratio of matching characters to total characters.
The simplest way to compute the similarity between two documents using word embeddings is to compute the document centroid vector. This is the vector that's the average of all the word vectors in the document.
Typically, the Jaccard similarity coefficient (or index) is used to compare the similarity between two sets. For two sets, A and B , the Jaccard index is defined to be the ratio of the size of their intersection and the size of their union: J(A,B) = (A ∩ B) / (A ∪ B)
You could look into the Levenshtein Distance. Those below a certain threshold will be considered similar. Two strings that are identical will have a distance of zero.
There's a C# implementation, amongst other languages, on Rosetta Code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With