Given this data (relative letter frequency from both languages):
spanish => 'e' => 13.72, 'a' => 11.72, 'o' => 8.44, 's' => 7.20, 'n' => 6.83,
english => 'e' => 12.60, 't' => 9.37, 'a' => 8.34, 'o' => 7.70, 'n' => 6.80,
And then computing the letter frequency for the string "this is a test" gives me:
"t"=>21.43, "s"=>14.29, "i"=>7.14, "r"=>7.14, "y"=>7.14, "'"=>7.14, "h"=>7.14, "e"=>7.14, "l"=>7.14
So, what would be a good approach for matching the given string letter frequency with a language (and try to detect the language)? I've seen (and have tested) some examples using levenshtein distance, and it seems to work fine until you add more languages.
"this is a test" gives (shortest distance:) [:english, 13] ...
"esto es una prueba" gives (shortest distance:) [:spanish, 13] ...
Have you considered using cosine similarity to determine the amount of similarity between two vectors?
The first vector would be the letter frequencies extracted from the test string (to be classified), and the second vector would be for a specific language.
You're currently extracting single letter frequencies (unigrams). I would suggest extracting higher order n-grams, such as bigrams or trigrams (and even larger if you had enough training data). For example, for bigrams you would compute the frequencies of "aa", "ab", "ac" ... "zz", which will allow you to extract more information than if you were just considering single character frequencies.
Be careful though, because you need more training data when you use higher order n-grams otherwise you will have many 0-values for character combinations you haven't seen before.
In addition, a second possibility is to use tf-idf (term-frequency inverse-document-frequency) weightings instead of pure letter (term) frequencies.
Here is a good slideshow on language identification for (very) short texts, which uses machine learning classifiers (but also has some other good info).
Here is a short paper A Comparison of Language Identification Approaches on Short, Query-Style Texts that you might also find useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With