I'm creating an application for detecting the language of short texts, with an average of < 100 characters and contains slang (e.g tweets, user queries, sms).
All the libraries I tested work well for normal web pages but not for very short text. The library that's giving the best results so far is Chrome's Language Detection (CLD) library which I had to build as a shared library.
CLD fails when the text is made of very short words. After looking at the source code of CLD, I see that it uses 4-grams so that could be the reason.
The approach I'm thinking of right now to improve the accuracy is:
What dataset is most suitable for this task? And how can I improve this approach?
So far I'm using EUROPARL and Wikipedia articles. I'm using NLTK for most of the work.
Language detection for very short texts is the topic of current research, so no conclusive answer can be given. An algorithm for Twitter data can be found in Carter, Tsagkias and Weerkamp 2011. See also the references there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With