Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Language detection for very short text [closed]

I'm creating an application for detecting the language of short texts, with an average of < 100 characters and contains slang (e.g tweets, user queries, sms).

All the libraries I tested work well for normal web pages but not for very short text. The library that's giving the best results so far is Chrome's Language Detection (CLD) library which I had to build as a shared library.

CLD fails when the text is made of very short words. After looking at the source code of CLD, I see that it uses 4-grams so that could be the reason.

The approach I'm thinking of right now to improve the accuracy is:

  • Remove brand names, numbers, urls and words like "software", "download", "internet"
  • Use a dictionary When the text contains a number of short words above a threashold or when it contains too few words.
  • The dictionary is created from wikipedia news articles + hunspell dictionaries.

What dataset is most suitable for this task? And how can I improve this approach?

So far I'm using EUROPARL and Wikipedia articles. I'm using NLTK for most of the work.

like image 994
MrD Avatar asked Jan 21 '23 01:01

MrD


1 Answers

Language detection for very short texts is the topic of current research, so no conclusive answer can be given. An algorithm for Twitter data can be found in Carter, Tsagkias and Weerkamp 2011. See also the references there.

like image 64
Fred Foo Avatar answered Feb 02 '23 08:02

Fred Foo