Is there anyway i can retrieve only English tweets using the Twitter's Live Straeming API? It seems like using "sample" or "filter" results around 60-70 percent of non-English tweets.
Thanks
Joel
I haven't found a good solution to this, I've solved this using the following:
1) filter by lang attribute equal to "en".
2) I found that several non-english languages are still in the english labelled tweets. So, I downloaded spanish, dutch, and indonesian word lists, and checked for number of non-english word occurrences in tweets. More than 1, and I discard it as non-english.
3) I think I need to filter for portuguese as well, need to investigate this.
Filtering only English-language messages from the twitter stream is an active research area. You could use an off-the-shelf language identification system to locally process the stream and select only messages in English. One such system is langid.py. Full disclosure, I am the author of langid.py.
Another system I know of is ldig by Nakatani Shuyo. I haven't had a chance to experiment with it yet, but it is made specifically for language identification of Twitter messages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With