Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get all twitter mentions using tweepy for users with Millions of followers

I have had a project in mind where I would download all the tweets sent to celebrities for the last one year and do a sentiment analysis on them and evaluate who had the most positive fans.

Then I discovered that you can at max retrieve twitter mentions for the last 7 days using tweepy/twitter API. I scavenged the net but couldn't find any ways to download tweets for the last one year.

Anyways, I decided to do the project on last 7 days data only and wrote the following code:

try:
    while 1:
        for results in tweepy.Cursor(twitter_api.search, q="@celebrity_handle").items(9999999):
            item = (results.text).encode('utf-8').strip()
            wr.writerow([item, results.created_at])  # write to a csv (tweet, date)

I am using the Cursor search api because the other way to get mentions (the more accurate one) has a limitation of retrieving the last 800 tweets only.

Anyways, after running the code overnight, I was able to download only 32K tweets. Around 90% of them were Retweets.

Is there a better more efficient way to get mentions data?

Do keep in mind, that:

  1. I want to do this for multiple celebrities. (Famous ones with millions of followers).
  2. I don't care about retweets.
  3. They have thousands to tweets sent out to them per day.

Any suggestions would be welcome but at the current moment, I am out of ideas.

like image 718
Piyush Avatar asked Dec 11 '22 15:12

Piyush


1 Answers

I would use the search api. I did something similar with the following code. It appears to have worked exactly as expected. I used it on a specific movie star, and pulled 15568 tweets, upon a quick scan all of which appear to be @mentions of them. (I pulled from their entire timeline.)

In your case, on a search you'd want to run, say, daily, I'd store the id of the last mention you pulled for each user, and set that value as "sinceId" each time you rerun the search.

As an aside, AppAuthHandler is much faster than OAuthHandler and you won't need user authentication for these kinds of data pulls.

auth = tweepy.AppAuthHandler(consumer_token, consumer_secret)
auth.secure = True
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

searchQuery = '@username' this is what we're searching for. in your case i would make a list and iterate through all of the usernames in each pass of the search query run.

retweet_filter='-filter:retweets' this filters out retweets

inside each api.search call below i would put the following in as the query parameter:

q=searchQuery+retweet_filter

the following code (and the api setup above) is from this link:

tweetsPerQry = 100 # this is the max the API permits

fName = 'tweets.txt' # We'll store the tweets in a text file.

If results from a specific ID onwards are reqd, set sinceId to that ID. else default to no lower limit, go as far back as API allows

sinceId = None

If results only below a specific ID are, set max_id to that ID. else default to no upper limit, start from the most recent tweet matching the search query.

max_id = -1L
//however many you want to limit your collection to.  how much storage space do you have?
maxTweets = 10000000 

tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'w') as f:
    while tweetCount < maxTweets:
        try:
            if (max_id <= 0):
                if (not sinceId):
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
                else:
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            since_id=sinceId)
            else:
                if (not sinceId):
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            max_id=str(max_id - 1))
                else:
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            max_id=str(max_id - 1),
                                            since_id=sinceId)
            if not new_tweets:
                print("No more tweets found")
                break
            for tweet in new_tweets:
                f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
                        '\n')
            tweetCount += len(new_tweets)
            print("Downloaded {0} tweets".format(tweetCount))
            max_id = new_tweets[-1].id
        except tweepy.TweepError as e:
            # Just exit if any error
            print("some error : " + str(e))
            break

print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))
like image 83
user108569 Avatar answered May 02 '23 16:05

user108569