Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I save streaming tweets in json via tweepy?

I've been learning Python for a couple of months through online courses and would like to further my learning through a real world mini project.

For this project, I would like to collect tweets from the twitter streaming API and store them in json format (though you can choose to just save the key information like status.text, status.id, I've been advised that the best way to do this is to save all the data and do the processing after). However, with the addition of the on_data() the code ceases to work. Would someone be able to to assist please? I'm also open to suggestions on the best way to store/process tweets! My end goal is to be able to track tweets based on demographic variables (e.g., country, user profile age, etc) and the sentiment of particular brands (e.g., Apple, HTC, Samsung).

In addition, I would also like to try filtering tweets by location AND keywords. I've adapted the code from How to add a location filter to tweepy module separately. However, while it works when there are a few keywords, it stops when the number of keywords grows. I presume my code is inefficient. Is there a better way of doing it?

### code to save tweets in json###
import sys
import tweepy
import json

consumer_key=" "
consumer_secret=" "
access_key = " "
access_secret = " "

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
file = open('today.txt', 'a')

class CustomStreamListener(tweepy.StreamListener):
    def on_status(self, status):
        print status.text

    def on_data(self, data):
        json_data = json.loads(data)
        file.write(str(json_data))

    def on_error(self, status_code):
        print >> sys.stderr, 'Encountered error with status code:', status_code
        return True # Don't kill the stream

    def on_timeout(self):
        print >> sys.stderr, 'Timeout...'
        return True # Don't kill the stream

sapi = tweepy.streaming.Stream(auth, CustomStreamListener())
sapi.filter(track=['twitter'])
like image 283
Eugene Yan Avatar asked May 08 '14 02:05

Eugene Yan


People also ask

How does Twitter use JSON?

All Twitter APIs that return Tweets provide that data encoded using JavaScript Object Notation (JSON). JSON is based on key-value pairs, with named attributes and associated values. These attributes, and their state are used to describe objects. At Twitter we serve many objects as JSON, including Tweets and Users.


3 Answers

I found a way to save the tweets to a json file. Happy to hear how it can be improved!

# initialize blank list to contain tweets
tweets = []
# file name that you want to open is the second argument
save_file = open('9may.json', 'a')

class CustomStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        self.api = api
        super(tweepy.StreamListener, self).__init__()

        self.save_file = tweets

    def on_data(self, tweet):
        self.save_file.append(json.loads(tweet))
        print tweet
        save_file.write(str(tweet))
like image 83
Eugene Yan Avatar answered Sep 21 '22 05:09

Eugene Yan


In rereading your original question, I realize that you ask a lot of smaller questions. I'll try to answer most of them here but some may merit actually asking a separate question on SO.

  • Why does it break with the addition of on_data ?

Without seeing the actual error, it's hard to say. It actually didn't work for me until I regenerated my consumer/acess keys, I'd try that.

There are a few things I might do differently than your answer.

tweets is a global list. This means that if you have multiple StreamListeners (i.e. in multiple threads), every tweet collected by any stream listener will be added to this list. This is because lists in Python refer to locations in memory--if that's confusing, here's a basic example of what I mean:

>>> bar = []
>>> foo = bar
>>> foo.append(7)
>>> print bar
[7]

Notice that even though you thought appended 7 to foo, foo and bar actually refer to the same thing (and therefore changing one changes both).

If you meant to do this, it's a pretty great solution. However, if your intention was to segregate tweets from different listeners, it could be a huge headache. I personally would construct my class like this:

class CustomStreamListener(tweepy.StreamListener):
    def __init__(self, api):
        self.api = api
        super(tweepy.StreamListener, self).__init__()

        self.list_of_tweets = []

This changes the tweets list to be only in the scope of your class. Also, I think it's appropriate to change the property name from self.save_file to self.list_of_tweets because you also name the file that you're appending the tweets to save_file. Although this will not strictly cause an error, it's confusing to human me that self.save_file is a list and save_file is a file. It helps future you and anyone else that reads your code figure out what the heck everything does/is. More on variable naming.

In my comment, I mentioned that you shouldn't use file as a variable name. file is a Python builtin function that constructs a new object of type file. You can technically overwrite it, but it is a very bad idea to do so. For more builtins, see the Python documentation.

  • How do I filter results on multiple keywords?

All keywords are OR'd together in this type of search, source:

sapi.filter(track=['twitter', 'python', 'tweepy'])

This means that this will get tweets containing 'twitter', 'python' or 'tweepy'. If you want the union (AND) all of the terms, you have to post-process by checking a tweet against the list of all terms you want to search for.

  • How do I filter results based on location AND keyword?

I actually just realized that you did ask this as its own question as I was about to suggest. A regex post-processing solution is a good way to accomplish this. You could also try filtering by both location and keyword like so:

sapi.filter(locations=[103.60998,1.25752,104.03295,1.44973], track=['twitter'])
  • What is the best way to store/process tweets?

That depends on how many you'll be collecting. I'm a fan of databases, especially if you're planning to do a sentiment analysis on a lot of tweets. When you collect data, you should only collect things you will need. This means, when you save results to your database/wherever in your on_data method, you should extract the important parts from the JSON and not save anything else. If for example you want to look at brand, country and time, only take those three things; don't save the entire JSON dump of the tweet because it'll just take up unnecessary space.

like image 45
Luigi Avatar answered Sep 22 '22 05:09

Luigi


I just insert the raw JSON into the database. It seems a bit ugly and hacky but it does work. A noteable problem is that the creation dates of the Tweets are stored as strings. How do I compare dates from Twitter data stored in MongoDB via PyMongo? provides a way to fix that (I inserted a comment in the code to indicate where one would perform that task)

# ...

client = pymongo.MongoClient()
db = client.twitter_db
twitter_collection = db.tweets

# ...

class CustomStreamListener(tweepy.StreamListener):
    # ...
    def on_status(self, status):
            try:
                twitter_json = status._json
                # TODO: Transform created_at to Date objects before insertion
                tweet_id = twitter_collection.insert(twitter_json)
            except:
                # Catch any unicode errors while printing to console
                # and just ignore them to avoid breaking application.
                pass
    # ...

stream = tweepy.Stream(auth, CustomStreamListener(), timeout=None, compression=True)
stream.sample()
like image 36
Kristian Rother Avatar answered Sep 22 '22 05:09

Kristian Rother