Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect language of text?

I have a form which lets users input text snippets. So how can figure out the language of the entered text?

Specifically these languages for now:

Arabic: هذه هي بعض النصوص العربية

Chinese: 这是一些阿拉伯文字

Japanese: これは、いくつかのアラビア語のテキストです

[Edit] The detection has work on text which is retrieved via an API too (no browser involved)

like image 946
Yeti Avatar asked May 02 '10 06:05

Yeti


People also ask

How does Python detect language of text?

Googletrans python library uses the google translate API to detect the language of text data.

Is there an app that detects language?

#1 – LingueeLinguee can detect and translate 25 languages and 234 language pairs in total by analyzing billions of translations. It's backed by the power of almost 400 lexicographers, translators, and linguists.

How is language automatically detected?

On the Review tab, in the Language group, click Language. Click Set Proofing Language. In the Language dialog box, select the Detect language automatically check box.

Can Google identify languages?

Google Translate's camera can now automatically detect languages.


1 Answers

You can figure out whether the characters are from the Arabic, Chinese, or Japanese sections of the Unicode map.

If you look at the list on Wikipedia, you'll see that each of those languages has many sections of the map. But you're not doing translation, so you don't need to worry about every last glyph.

For example, your Chinese text begins (in hex) 0x8FD9 0x662F 0x4E00 - and those are all in the "CJK Unified Ideographs" section, which is Chinese. Here are a few ranges to get you started:

Arabic (0600–06FF)

Japanese

  • Hiragana (3040–309F)
  • Katakana (30A0–30FF)
  • Kanbun (3190–319F)

Chinese

  • CJK Unified Ideographs (4E00–9FFF)

(I got the hex for your Chinese by using a Chinese to Unicode Converter.)

like image 83
egrunin Avatar answered Sep 20 '22 17:09

egrunin