Is there a service/library (free or paid) that takes a piece of text and return the language of it?
I need to go over a million blog posts and determine their languages.
On the Review tab, in the Language group, click Language. Click Set Proofing Language. In the Language dialog box, select the Detect language automatically check box. Review the languages shown above the double line in the Mark selected text as list.
It can be retrieved by calling getLanguages on a Text_LanguageDetect object. It returns an array of strings that represent the languages, e.g. array('albanian', 'arabic', 'azeri') . To actually detect the language of a piece of text, use the detect method on the Text_LanguageDetect object.
How language detection works? Language classifications rely upon using a primer of specialized text called a 'corpus. ' There is one corpus for each language the algorithm can identify. In summary, the input text is compared to each corpus, and pattern matching is used to identify the strongest correlation to a corpus.
I think this is the best out there!
https://code.google.com/p/language-detection/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With