Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Streaming API with languages

Tags:

twitter

Is there anyway i can retrieve only English tweets using the Twitter's Live Straeming API? It seems like using "sample" or "filter" results around 60-70 percent of non-English tweets.

Thanks

Joel

like image 212
Joel Avatar asked Sep 22 '10 07:09

Joel


2 Answers

I haven't found a good solution to this, I've solved this using the following:

1) filter by lang attribute equal to "en".

2) I found that several non-english languages are still in the english labelled tweets. So, I downloaded spanish, dutch, and indonesian word lists, and checked for number of non-english word occurrences in tweets. More than 1, and I discard it as non-english.

3) I think I need to filter for portuguese as well, need to investigate this.

like image 86
Drew Avatar answered Oct 23 '22 07:10

Drew


Filtering only English-language messages from the twitter stream is an active research area. You could use an off-the-shelf language identification system to locally process the stream and select only messages in English. One such system is langid.py. Full disclosure, I am the author of langid.py.

Another system I know of is ldig by Nakatani Shuyo. I haven't had a chance to experiment with it yet, but it is made specifically for language identification of Twitter messages.

like image 27
saffsd Avatar answered Oct 23 '22 07:10

saffsd