Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

filtering of tweets received from statuses/filter (streaming API)

I have N different keywords that i am tracking (for sake of simplicity, let N=3). So in GET statuses/filter, I will give 3 keywords in the "track" argument.

Now the tweets that i will be receiving can be from ANY of the 3 keywords that i mentioned. The problem is that i want to resolve as to which tweet corresponds to which keyword. i.e. mapping between tweets and the keyword(s) (that are mentioned in the "track" argument).

Apparently, there is no way to do this without doing any processing on the tweets received.

So i was wondering what is the best way to do this processing? Search for keywords in the text of the tweet? What about case-insensitive? What about when multiple words are there in same keyword, e.g: "Katrina Kaif" ?

I am currently trying to formulate some regular expression...

I was thinking the BEST way would to use the same logic (regular expressions etc.) as is used originally be statuses/filter API. How to know what logic is used by Twitter API statuses/filter itself to match tweets to the keywords ?

Advice? Help?

P.S.: I am using Python, Tweepy, Regex, MongoDb/Apache S4 (for distributed computing)

like image 295
user1599964 Avatar asked May 17 '13 06:05

user1599964


People also ask

What is the difference between Twitter search API and streaming API?

With a specific keyword, you can typically only poll the last 5,000 tweets per keyword. Unlike Twitter's Search API where you are polling data from tweets that have already happened, Twitter's Streaming API is a push of data as tweets happen in near real-time.

Can tweets be filtered?

Yes. You can control what you see and who you interact with on Twitter. You have three options located in your notifications settings to filter the notifications you receive: Quality filter, muted words, and advanced filters.

What is filtered stream Twitter?

The filtered stream endpoint group enables developers to filter the real-time stream of public Tweets. This endpoint group's functionality includes multiple endpoints that enable you to create and manage rules, and apply those rules to filter a stream of real-time Tweets that will return matching public Tweets.

What is the Twitter streaming API?

The Twitter API allows you to stream public Tweets from the platform in real-time so that you can display them and basic metrics about them.


1 Answers

The first thing coming into my mind is to create a separate stream for every keyword and start it in a separate thread, like this:

from threading import Thread
import tweepy


class StreamListener(tweepy.StreamListener):
    def __init__(self, keyword, api=None):
        super(StreamListener, self).__init__(api)
        self.keyword = keyword

    def on_status(self, tweet):
        print 'Ran on_status'

    def on_error(self, status_code):
        print 'Error: ' + repr(status_code)
        return False

    def on_data(self, data):
        print self.keyword, data
        print 'Ok, this is actually running'


def start_stream(auth, track):
    tweepy.Stream(auth=auth, listener=StreamListener(track)).filter(track=[track])


auth = tweepy.OAuthHandler(<consumer_key>, <consumer_secret>)
auth.set_access_token(<key>, <secret>)

track = ['obama', 'cats', 'python']
for item in track:
    thread = Thread(target=start_stream, args=(auth, item))
    thread.start()

If you still want to distinguish tweets by keywords by yourself in a single stream, here's some info on how twitter uses track request parameter. There are some edge cases that could cause problems.

Hope that helps.

like image 58
alecxe Avatar answered Sep 18 '22 01:09

alecxe