I have some Twitter data that I would like to plot activity overtime based on the type of tweet (tweet/mention/retweet).
The data is currently loaded into a list of tuples that contains date
and type
:
time = [('2014-04-13', 'tweet'),
('2014-04-13', 'tweet'),
('2014-04-13', 'mention'),
('2014-04-13', 'retweet'),
('2014-04-13', 'mention'),
('2014-04-13', 'tweet'),
('2014-04-13', 'retweet'),
('2014-04-13', 'mention'),
('2014-04-13', 'tweet'),
('2014-04-13', 'retweet'),
('2014-04-13', 'retweet'),
('2014-04-13', 'mention'),
('2014-04-13', 'tweet'),
('2014-04-13', 'tweet'),
('2014-04-13', 'tweet'),
('2014-04-13', 'tweet'),
('2014-04-13', 'mention'),
('2014-04-13', 'retweet'),
('2014-04-13', 'mention'),
('2014-04-13', 'tweet')]
I've loaded the data into a pandas DataFrame:
time_df = pd.DataFrame(time, columns=['date','time'])
Now that data looks like this:
date time
0 2014-04-13 tweet
1 2014-04-13 tweet
2 2014-04-13 mention
3 2014-04-13 retweet
4 2014-04-13 mention
...
...
...
However, now I'm lost when it comes to plotting this data over time. Also, I would like to break out each type (tweet/mention/retweet) as a different color line. I should also note that sometimes I might need to aggregate the data by day/week/month.
Ideally I would like my plot to look similar to the following plot, except with Tweet, Mention, Retweet:
So, I think I understand what you need to do, even if this isn't explicit in your question.
Allow me to mock up some data:
import numpy as np
import pandas
import random
tweet_types = ['tweet', 'retweet', 'mention']
index = pandas.DatetimeIndex(freq='5min', start='2014-04-13', end='2014-05-13')
tweets = [random.choice(tweet_types) for _ in range(len(index))]
time_df = pandas.DataFrame(index=index, data=tweets, columns=['tweet type'])
time_df['day'] = time_df.index.date
time_df['count'] = 1
print(time_df.head())
So the first few rows now look like this:
tweet type day count
2014-04-13 00:00:00 mention 2014-04-13 1
2014-04-13 00:05:00 mention 2014-04-13 1
2014-04-13 00:10:00 tweet 2014-04-13 1
2014-04-13 00:15:00 tweet 2014-04-13 1
2014-04-13 00:20:00 retweet 2014-04-13 1
I added the count
value because we need something to total up for our daily aggregation, done here:
daily_counts = time_df.groupby(by=['tweet type', 'day']).count()
daily_counts_xtab = daily_counts.unstack(level='tweet type')['count']
print(daily_counts_xtab.head())
Which gives us...
tweet type mention retweet tweet
day
2014-04-13 89 101 98
2014-04-14 98 113 77
2014-04-15 87 103 98
2014-04-16 81 107 100
2014-04-17 96 92 100
So then
daily_counts_xtab.plot()
Gives me:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With