Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compare short strings in different languages for similar sound - is Soundex the answer?

How could i get a sound similarity "rating" for a string written in one language with another string in another language: i.e an algorithm that will identify that

"David Letterman" and "דוד לטרמן" are strings that sound alike.

-Oh, yes, btw the above is Hebrew for, you guessed it: "David Letterman", and it sounds/spoken almost the same as in English..

The only raw material I have is strings in unicode in their respective languages. That is, i do not have phonemes or phonetic transcriptions/translations of the strings.

I Have already implemented a Soundex implementation tweak kinda thing, which works so-so. Is this the way to go?

like image 381
RabinDev Avatar asked May 26 '11 15:05

RabinDev


2 Answers

Soundex may not be perfect, but it seems like a reasonable approach, at least for your specific example of English/Hebrew matching.

You definitely can't use the rule about preserving the first letter of the name, but I never liked that even for the Latin alphabet (because I'd have to look under both "E" and "Y" for my mother's family name). I recommend just treating the first letter like all the others.

Then it's just a matter of mapping the Hebrew letters to Soundex codes. You don't really need an intermediate English transliteration; just code the Hebrew → Soundex mapping directly.

  • בוףפ → 1
  • גזחךכסקש → 2
  • דטת → 3
  • ץצ → 32
  • ל → 4
  • םמןנ → 5
  • ר → 6
  • אהיע → ignored

However, because Soundex is English-centric, it may not correctly handle certain ambiguities in the Hebrew pronunciation:

  • ו is mapped to 1 (like English V) in the list above, but it often represents O, U, or W, in which case it should be ignored in Soundex.
  • ח is hard to classify due to its lack of an English equivalent. I put it in category 2 because this (1) matches the "ch" transliteration, and (2) allows ך/כ to have the same category with or without a dagesh.
  • Ashkenazi pronuncation would split ת between categories 2 and 3.

To deal with this, you could generate multiple Soundex keys for a string. E.g., "שבת" would map to both 212 and 213.

Similar mappings can be made for Greek:

  • ΒΠΦ → 1
  • Ψ → 12
  • ΓΖΚΞΣΧ → 2
  • ΔΘΤ → 3
  • Λ → 4
  • ΜΝ → 5
  • Ρ → 6
  • ΑΕΗΙΟΥΩ → ignored

or Russian:

  • БВПФ → 1
  • ГЖЗКСХЧШЩ → 2
  • ДТ → 3
  • Ц → 32
  • Л → 4
  • МН → 5
  • Р → 6
  • АЕЁИЙОУЪЫЬЭЮЯ → ignored

(Note that some of the 2's might be 32's, depending on your transliteration convention.)


A similarity "rating" can be obtained based on a metric like longest common subsequence length or Levenshtein distance on the Soundex values.

For example, you can define the "similarity" between two strings as 2*lcslen(A, B)/(len(A)+len(B)) to obtain a score between 0 and 1.

like image 119
dan04 Avatar answered Sep 27 '22 22:09

dan04


I'd suggest looking into Daitch-Mokotoff Soundex Code (particularly good with Hebrew). Check this, which takes English characters as input and this, which takes Hebrew characters as input

like image 45
Amnon Avatar answered Sep 27 '22 21:09

Amnon