I've been learning Python for a couple of months through online courses and would like to further my learning through a real world mini project.
For this project, I would like to collect tweets from the twitter streaming API and store them in json format (though you can choose to just save the key information like status.text, status.id, I've been advised that the best way to do this is to save all the data and do the processing after). However, with the addition of the on_data() the code ceases to work. Would someone be able to to assist please? I'm also open to suggestions on the best way to store/process tweets! My end goal is to be able to track tweets based on demographic variables (e.g., country, user profile age, etc) and the sentiment of particular brands (e.g., Apple, HTC, Samsung).
In addition, I would also like to try filtering tweets by location AND keywords. I've adapted the code from How to add a location filter to tweepy module separately. However, while it works when there are a few keywords, it stops when the number of keywords grows. I presume my code is inefficient. Is there a better way of doing it?
### code to save tweets in json###
import sys
import tweepy
import json
consumer_key=" "
consumer_secret=" "
access_key = " "
access_secret = " "
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
file = open('today.txt', 'a')
class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
print status.text
def on_data(self, data):
json_data = json.loads(data)
file.write(str(json_data))
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
sapi = tweepy.streaming.Stream(auth, CustomStreamListener())
sapi.filter(track=['twitter'])
All Twitter APIs that return Tweets provide that data encoded using JavaScript Object Notation (JSON). JSON is based on key-value pairs, with named attributes and associated values. These attributes, and their state are used to describe objects. At Twitter we serve many objects as JSON, including Tweets and Users.
I found a way to save the tweets to a json file. Happy to hear how it can be improved!
# initialize blank list to contain tweets
tweets = []
# file name that you want to open is the second argument
save_file = open('9may.json', 'a')
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
super(tweepy.StreamListener, self).__init__()
self.save_file = tweets
def on_data(self, tweet):
self.save_file.append(json.loads(tweet))
print tweet
save_file.write(str(tweet))
In rereading your original question, I realize that you ask a lot of smaller questions. I'll try to answer most of them here but some may merit actually asking a separate question on SO.
on_data
?
Without seeing the actual error, it's hard to say. It actually didn't work for me until I regenerated my consumer/acess keys, I'd try that.
There are a few things I might do differently than your answer.
tweets
is a global list. This means that if you have multiple StreamListeners
(i.e. in multiple threads), every tweet collected by any stream listener will be added to this list. This is because lists in Python refer to locations in memory--if that's confusing, here's a basic example of what I mean:
>>> bar = []
>>> foo = bar
>>> foo.append(7)
>>> print bar
[7]
Notice that even though you thought appended 7 to foo
, foo
and bar
actually refer to the same thing (and therefore changing one changes both).
If you meant to do this, it's a pretty great solution. However, if your intention was to segregate tweets from different listeners, it could be a huge headache. I personally would construct my class like this:
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
super(tweepy.StreamListener, self).__init__()
self.list_of_tweets = []
This changes the tweets list to be only in the scope of your class. Also, I think it's appropriate to change the property name from self.save_file
to self.list_of_tweets
because you also name the file that you're appending the tweets to save_file
. Although this will not strictly cause an error, it's confusing to human me that self.save_file
is a list and save_file
is a file. It helps future you and anyone else that reads your code figure out what the heck everything does/is. More on variable naming.
In my comment, I mentioned that you shouldn't use file
as a variable name. file
is a Python builtin function that constructs a new object of type file
. You can technically overwrite it, but it is a very bad idea to do so. For more builtins, see the Python documentation.
All keywords are OR
'd together in this type of search, source:
sapi.filter(track=['twitter', 'python', 'tweepy'])
This means that this will get tweets containing 'twitter', 'python' or 'tweepy'. If you want the union (AND
) all of the terms, you have to post-process by checking a tweet against the list of all terms you want to search for.
I actually just realized that you did ask this as its own question as I was about to suggest. A regex post-processing solution is a good way to accomplish this. You could also try filtering by both location and keyword like so:
sapi.filter(locations=[103.60998,1.25752,104.03295,1.44973], track=['twitter'])
That depends on how many you'll be collecting. I'm a fan of databases, especially if you're planning to do a sentiment analysis on a lot of tweets. When you collect data, you should only collect things you will need. This means, when you save results to your database/wherever in your on_data
method, you should extract the important parts from the JSON and not save anything else. If for example you want to look at brand, country and time, only take those three things; don't save the entire JSON dump of the tweet because it'll just take up unnecessary space.
I just insert the raw JSON into the database. It seems a bit ugly and hacky but it does work. A noteable problem is that the creation dates of the Tweets are stored as strings. How do I compare dates from Twitter data stored in MongoDB via PyMongo? provides a way to fix that (I inserted a comment in the code to indicate where one would perform that task)
# ...
client = pymongo.MongoClient()
db = client.twitter_db
twitter_collection = db.tweets
# ...
class CustomStreamListener(tweepy.StreamListener):
# ...
def on_status(self, status):
try:
twitter_json = status._json
# TODO: Transform created_at to Date objects before insertion
tweet_id = twitter_collection.insert(twitter_json)
except:
# Catch any unicode errors while printing to console
# and just ignore them to avoid breaking application.
pass
# ...
stream = tweepy.Stream(auth, CustomStreamListener(), timeout=None, compression=True)
stream.sample()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With