Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word language detection in C++

Tags:

c++

After searching on Google I don't know any standard way or library for detecting whether a particular word is of which language.

Suppose I have any word, how could I find which language it is: English, Japanese, Italian, German etc.

Is there any library available for C++? Any suggestion in this regard will be greatly appreciated!

like image 690
Vivek Kumar Avatar asked Apr 04 '11 11:04

Vivek Kumar


People also ask

How does NLP detect language?

The python function receives a text and target language as parameters. Then it detects the language of the text provided and if the language of the text is the same as the target language it returns the same text, but it is not the same it translates the text provided to the target language.

How do you identify language algorithms?

How language detection works? Language classifications rely upon using a primer of specialized text called a 'corpus. ' There is one corpus for each language the algorithm can identify. In summary, the input text is compared to each corpus, and pattern matching is used to identify the strongest correlation to a corpus.

Can Excel detect language?

On the Review tab, in the Language group, click Language. Click Set Proofing Language. In the Language dialog box, select the Detect language automatically check box. Review the languages shown above the double line in the Mark selected text as list.


2 Answers

Simple language recognition from words is easy. You don't need to understand the semantics of the text. You don't need any computationally expensive algorithms, just a fast hash map. The problem is, you need a lot of data. Fortunately, you can probably find dictionaries of words in each language you care about. Define a bit mask for each language, that will allow you to mark words like "the" as recognized in multiple languages. Then, read each language dictionary into your hash map. If the word is already present from a different language, just mark the current language also.

Suppose a given word is in English and French. Then when you look it up ex("commercial") will map to ENGLISH|FRENCH, suppose ENGLISH = 1, FRENCH=2, ... You'll find the value 3. If you want to know whether the words are in your lang only, you would test:

int langs = dict["the"];
if (langs | mylang == mylang)
   // no other language



Since there will be other languages, probably a more general approach is better. For each bit set in the vector, add 1 to the corresponding language. Do this for n words. After about n=10 words, in a typical text, you'll have 10 for the dominant language, maybe 2 for a language that it is related to (like English/French), and you can determine with high probability that the text is English. Remember, even if you have a text that is in a language, it can still have a quote in another, so the mere presence of a foreign word doesn't mean the document is in that language. Pick a threshhold, it will work quite well (and very, very fast).

Obviously the hardest thing about this is reading in all the dictionaries. This isn't a code problem, it's a data collection problem. Fortunately, that's your problem, not mine.

To make this fast, you will need to preload the hash map, otherwise loading it up initially is going to hurt. If that's an issue, you will have to write store and load methods for the hash map that block load the entire thing in efficiently.

like image 77
Dov Avatar answered Oct 06 '22 16:10

Dov


I have found Google's CLD very helpful, it's written in C++, and from their web site:

"CLD (Compact Language Detector) is the library embedded in Google's Chromium browser. The library detects the language from provided UTF8 text (plain text or HTML). It's implemented in C++, with very basic Python bindings."

like image 21
mrz Avatar answered Oct 06 '22 17:10

mrz