Word language detection in C++

2 Answers

Simple language recognition from words is easy. You don't need to understand the semantics of the text. You don't need any computationally expensive algorithms, just a fast hash map. The problem is, you need a lot of data. Fortunately, you can probably find dictionaries of words in each language you care about. Define a bit mask for each language, that will allow you to mark words like "the" as recognized in multiple languages. Then, read each language dictionary into your hash map. If the word is already present from a different language, just mark the current language also.

Suppose a given word is in English and French. Then when you look it up ex("commercial") will map to ENGLISH|FRENCH, suppose ENGLISH = 1, FRENCH=2, ... You'll find the value 3. If you want to know whether the words are in your lang only, you would test:

Click to copy

int langs = dict["the"];
if (langs | mylang == mylang)
   // no other language

Since there will be other languages, probably a more general approach is better. For each bit set in the vector, add 1 to the corresponding language. Do this for n words. After about n=10 words, in a typical text, you'll have 10 for the dominant language, maybe 2 for a language that it is related to (like English/French), and you can determine with high probability that the text is English. Remember, even if you have a text that is in a language, it can still have a quote in another, so the mere presence of a foreign word doesn't mean the document is in that language. Pick a threshhold, it will work quite well (and very, very fast).

Obviously the hardest thing about this is reading in all the dictionaries. This isn't a code problem, it's a data collection problem. Fortunately, that's your problem, not mine.

To make this fast, you will need to preload the hash map, otherwise loading it up initially is going to hurt. If that's an issue, you will have to write store and load methods for the hash map that block load the entire thing in efficiently.

answered Oct 06 '22 16:10

Dov

I have found Google's CLD very helpful, it's written in C++, and from their web site:

"CLD (Compact Language Detector) is the library embedded in Google's Chromium browser. The library detects the language from provided UTF8 text (plain text or HTML). It's implemented in C++, with very basic Python bindings."

answered Oct 06 '22 17:10

mrz

Related questions
                            
                                How to determine inter-library dependencies?
                            
                                c++ copy initialization & direct initialization, the weird case
                            
                                How to parse MJPEG HTTP Stream within C++?
                            
                                C++ std library linking with different C++ standards
                            
                                Convert int64_t to time_duration
                            
                                C++: output contents of a Unicode file to console in Windows
                            
                                How do I convert a string to double using only math.h
                            
                                Why is the wrong function being executed?
                            
                                Using decltype in a late specified return in CRTP base class
                            
                                Assignment via copy-and-swap vs two locks
                            
                                An std container inside a template method
                            
                                Error during QT Build with OpenSSL
                            
                                Why does declaring a "static const" member in a header file cause linker errors?
                            
                                Differences between FFTW and CUFFT output
                            
                                g++ including boost library
                            
                                Convert QString into QByteArray with either UTF-8 or Latin1 encoding
                            
                                how to call a C++ dll exported function from c#
                            
                                how does templates work, are they always inlined?
                            
                                Inline throw() method in C++
                            
                                GCC Optimization results in "Undefined symbol" at runtime

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Word language detection in C++

Tags:

c++

Vivek Kumar

People also ask

2 Answers

Dov

mrz

Recent Activity

Donate For Us