I have a massive blob of JSON data formatted as follows:
[
[{
"created_at": "2017-04-28T16:52:36Z",
"as_of": "2017-04-28T17:00:05Z",
"trends": [{
"url": "http://twitter.com/search?q=%23ChavezSigueCandanga",
"query": "%23ChavezSigueCandanga",
"tweet_volume": 44587,
"name": "#ChavezSigueCandanga",
"promoted_content": null
}, {
"url": "http://twitter.com/search?q=%2327Abr",
"query": "%2327Abr",
"tweet_volume": 79781,
"name": "#27Abr",
"promoted_content": null
}],
"locations": [{
"woeid": 395277,
"name": "Turmero"
}]
}],
[{
"created_at": "2017-04-28T16:57:35Z",
"as_of": "2017-04-28T17:00:03Z",
"trends": [{
"url": "http://twitter.com/search?q=%23fyrefestival",
"query": "%23fyrefestival",
"tweet_volume": 141385,
"name": "#fyrefestival",
"promoted_content": null
}, {
"url": "http://twitter.com/search?q=%23HotDocs17",
"query": "%23HotDocs17",
"tweet_volume": null,
"name": "#HotDocs17",
"promoted_content": null
}],
"locations": [{
"woeid": 9807,
"name": "Vancouver"
}]
}]
]...
I wrote a function that formats it into a pandas dataframe that takes this form:
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
| | name | promoted_content | query | tweet_volume | url | as_of | created_at | location_name | location_woeid |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
| 47 | #BatesMotel | | %23BatesMotel | 59748 | http://twitter.com/search?q=%23BatesMotel | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
| 48 | #AdviceForPeopleJoiningTwitter | | %23AdviceForPeopleJoiningTwitter | 51222 | http://twitter.com/search?q=%23AdviceForPeopleJoiningTwitter | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
| 49 | #CADTHSymp | | %23CADTHSymp | | http://twitter.com/search?q=%23CADTHSymp | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
| 0 | #WorldPenguinDay | | %23WorldPenguinDay | 79006 | http://twitter.com/search?q=%23WorldPenguinDay | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
| 1 | #TravelTuesday | | %23TravelTuesday | | http://twitter.com/search?q=%23TravelTuesday | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
| 2 | #DigitalLeap | | %23DigitalLeap | | http://twitter.com/search?q=%23DigitalLeap | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
| … | … | … | … | … | … | … | … | … | … |
| 0 | #nusnc17 | | %23nusnc17 | | http://twitter.com/search?q=%23nusnc17 | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
| 1 | #WorldPenguinDay | | %23WorldPenguinDay | 79006 | http://twitter.com/search?q=%23WorldPenguinDay | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
| 2 | #littleboyblue | | %23littleboyblue | 20772 | http://twitter.com/search?q=%23littleboyblue | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
This is the function that writes the JSON to a DataFrame:
def trends_to_dataframe(data):
df = pd.DataFrame()
for location in data:
temp_df = pd.DataFrame()
for trend in location[0]['trends']:
temp_df = temp_df.append(pd.Series(trend), ignore_index=True)
temp_df['as_of'] = location[0]['as_of']
temp_df['created_at'] = location[0]['created_at']
temp_df['location_name'] = location[0]['locations'][0]['name']
temp_df['location_woeid'] = location[0]['locations'][0]['woeid']
df = df.append(temp_df)
return df
Unfortunately, with the amount of data I have (and some simple timers I tested with) this will take about 4 hours to finish. Any thoughts on how to speed this up?
You could speed things up by async flattening the data with concurrent.futures, then loading it all into a DataFrame with from_records.
from concurrent.futures import ThreadPoolExecutor
def get_trends(location):
trends = []
for trend in location[0]['trends']:
trend['as_of'] = location[0]['as_of']
trend['created_at'] = location[0]['created_at']
trend['location_name'] = location[0]['locations'][0]['name']
trend['location_woeid'] = location[0]['locations'][0]['woeid']
trends.append(trend)
return trends
flat_data = []
with ThreadPoolExecutor() as executor:
for location in data:
flat_data += get_trends(location)
df = pd.DataFrame.from_records(flat_data)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With