Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect the language of a string?

What's the best way to detect the language of a string?

like image 534
Alon Gubkin Avatar asked Jul 28 '09 08:07

Alon Gubkin


People also ask

Can Python detect language of text?

Googletrans python library uses the google translate API to detect the language of text data.

How do I get my language to automatically detect?

In Outlook 2019 and 2021 and Word 2019 and 2021 On the Review tab, in the Language group, click Language. Click Set Proofing Language. In the Language dialog box, select the Detect language automatically check box. Review the languages shown above the double line in the Mark selected text as list.

Can Google identify languages?

Starting today, Google Translate's camera can automatically detect languages so you can point your camera at a flyer or sign and get results in your native tongue even if you don't know what language you're reading.


3 Answers

If the context of your code have internet access, you can try to use the Google API for language detection. http://code.google.com/apis/ajaxlanguage/documentation/

var text = "¿Dónde está el baño?"; google.language.detect(text, function(result) {   if (!result.error) {     var language = 'unknown';     for (l in google.language.Languages) {       if (google.language.Languages[l] == result.language) {         language = l;         break;       }     }     var container = document.getElementById("detection");     container.innerHTML = text + " is: " + language + "";   } }); 

And, since you are using c#, take a look at this article on how to call the API from c#.

UPDATE: That c# link is gone, here's a cached copy of the core of it:

string s = TextBoxTranslateEnglishToHebrew.Text; string key = "YOUR GOOGLE AJAX API KEY"; GoogleLangaugeDetector detector =    new GoogleLangaugeDetector(s, VERSION.ONE_POINT_ZERO, key);  GoogleTranslator gTranslator = new GoogleTranslator(s, VERSION.ONE_POINT_ZERO,    detector.LanguageDetected.Equals("iw") ? LANGUAGE.HEBREW : LANGUAGE.ENGLISH,    detector.LanguageDetected.Equals("iw") ? LANGUAGE.ENGLISH : LANGUAGE.HEBREW,    key);  TextBoxTranslation.Text = gTranslator.Translation; 

Basically, you need to create a URI and send it to Google that looks like:

http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=hello%20worled&langpair=en%7ciw&key=your_google_api_key_goes_here

This tells the API that you want to translate "hello world" from English to Hebrew, to which Google's JSON response would look like:

{"responseData": {"translatedText":"שלום העולם"}, "responseDetails": null, "responseStatus": 200} 

I chose to make a base class that represents a typical Google JSON response:

[Serializable] public class JSONResponse {    public string responseDetails = null;    public string responseStatus = null; } 

Then, a Translation object that inherits from this class:

[Serializable] public class Translation: JSONResponse {    public TranslationResponseData responseData =      new TranslationResponseData(); } 

This Translation class has a TranslationResponseData object that looks like this:

[Serializable] public class TranslationResponseData {    public string translatedText; } 

Finally, we can make the GoogleTranslator class:

using System; using System.Collections.Generic; using System.Text;  using System.Web; using System.Net; using System.IO; using System.Runtime.Serialization.Json;  namespace GoogleTranslationAPI {     public class GoogleTranslator    {       private string _q = "";       private string _v = "";       private string _key = "";       private string _langPair = "";       private string _requestUrl = "";       private string _translation = "";        public GoogleTranslator(string queryTerm, VERSION version, LANGUAGE languageFrom,          LANGUAGE languageTo, string key)       {          _q = HttpUtility.UrlPathEncode(queryTerm);          _v = HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(version));          _langPair =             HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(languageFrom) +             "|" + EnumStringUtil.GetStringValue(languageTo));          _key = HttpUtility.UrlEncode(key);           string encodedRequestUrlFragment =             string.Format("?v={0}&q={1}&langpair={2}&key={3}",             _v, _q, _langPair, _key);           _requestUrl = EnumStringUtil.GetStringValue(BASEURL.TRANSLATE) + encodedRequestUrlFragment;           GetTranslation();       }        public string Translation       {          get { return _translation; }          private set { _translation = value; }       }        private void GetTranslation()       {          try          {             WebRequest request = WebRequest.Create(_requestUrl);             WebResponse response = request.GetResponse();              StreamReader reader = new StreamReader(response.GetResponseStream());             string json = reader.ReadLine();             using (MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(json)))             {                DataContractJsonSerializer ser =                   new DataContractJsonSerializer(typeof(Translation));                Translation translation = ser.ReadObject(ms) as Translation;                 _translation = translation.responseData.translatedText;             }          }          catch (Exception) { }       }    } } 
like image 179
Magnus Johansson Avatar answered Sep 19 '22 17:09

Magnus Johansson


Fast answer: NTextCat (NuGet, Online Demo)

Long answer:

Currently the best way seems to use classifiers trained to classify piece of text into one (or more) of languages from predefined set.

There is a Perl tool called TextCat. It has language models for 74 most popular languages. There is a huge number of ports of this tool into different programming languages.

There were no ports in .Net. So I have written one: NTextCat on GitHub.

It is pure .NET Framework DLL + command line interface to it. By default, it uses a profile of 14 languages.

Any feedback is very appreciated! New ideas and feature requests are welcomed too :)

Alternative is to use numerous online services (e.g. one from Google mentioned, detectlanguage.com, langid.net, etc.).

like image 42
Ivan Akcheurov Avatar answered Sep 19 '22 17:09

Ivan Akcheurov


A statistical approach using digraphs or trigraphs is a very good indicator. For example, here are the most common digraphs in English in order: http://www.letterfrequency.org/#digraph-frequency (one can find better or more complete lists). This method may have a better success rate than word analysis for short snippets of text because there are more digraphs in text than there are complete words.

like image 36
Greg Hewgill Avatar answered Sep 20 '22 17:09

Greg Hewgill