How could i get a sound similarity "rating" for a string written in one language with another string in another language: i.e an algorithm that will identify that
"David Letterman" and "דוד לטרמן" are strings that sound alike.
-Oh, yes, btw the above is Hebrew for, you guessed it: "David Letterman", and it sounds/spoken almost the same as in English..
The only raw material I have is strings in unicode in their respective languages. That is, i do not have phonemes or phonetic transcriptions/translations of the strings.
I Have already implemented a Soundex implementation tweak kinda thing, which works so-so. Is this the way to go?
Soundex may not be perfect, but it seems like a reasonable approach, at least for your specific example of English/Hebrew matching.
You definitely can't use the rule about preserving the first letter of the name, but I never liked that even for the Latin alphabet (because I'd have to look under both "E" and "Y" for my mother's family name). I recommend just treating the first letter like all the others.
Then it's just a matter of mapping the Hebrew letters to Soundex codes. You don't really need an intermediate English transliteration; just code the Hebrew → Soundex mapping directly.
However, because Soundex is English-centric, it may not correctly handle certain ambiguities in the Hebrew pronunciation:
To deal with this, you could generate multiple Soundex keys for a string. E.g., "שבת" would map to both 212 and 213.
Similar mappings can be made for Greek:
or Russian:
(Note that some of the 2's might be 32's, depending on your transliteration convention.)
A similarity "rating" can be obtained based on a metric like longest common subsequence length or Levenshtein distance on the Soundex values.
For example, you can define the "similarity" between two strings as 2*lcslen(A, B)/(len(A)+len(B)) to obtain a score between 0 and 1.
I'd suggest looking into Daitch-Mokotoff Soundex Code (particularly good with Hebrew). Check this, which takes English characters as input and this, which takes Hebrew characters as input
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With