Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up JSON to dataframe w/ a lot of data manipulation

I have a massive blob of JSON data formatted as follows:

[
    [{
        "created_at": "2017-04-28T16:52:36Z",
        "as_of": "2017-04-28T17:00:05Z",
        "trends": [{
            "url": "http://twitter.com/search?q=%23ChavezSigueCandanga",
            "query": "%23ChavezSigueCandanga",
            "tweet_volume": 44587,
            "name": "#ChavezSigueCandanga",
            "promoted_content": null
        }, {
            "url": "http://twitter.com/search?q=%2327Abr",
            "query": "%2327Abr",
            "tweet_volume": 79781,
            "name": "#27Abr",
            "promoted_content": null
        }],
        "locations": [{
            "woeid": 395277,
            "name": "Turmero"
        }]
    }],
    [{
        "created_at": "2017-04-28T16:57:35Z",
        "as_of": "2017-04-28T17:00:03Z",
        "trends": [{
            "url": "http://twitter.com/search?q=%23fyrefestival",
            "query": "%23fyrefestival",
            "tweet_volume": 141385,
            "name": "#fyrefestival",
            "promoted_content": null
        }, {
            "url": "http://twitter.com/search?q=%23HotDocs17",
            "query": "%23HotDocs17",
            "tweet_volume": null,
            "name": "#HotDocs17",
            "promoted_content": null
        }],
        "locations": [{
            "woeid": 9807,
            "name": "Vancouver"
        }]
    }]
]...

I wrote a function that formats it into a pandas dataframe that takes this form:

+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
|    |              name              | promoted_content |              query               | tweet_volume |                             url                              |        as_of         |      created_at      | location_name | location_woeid |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
| 47 | #BatesMotel                    |                  | %23BatesMotel                    | 59748        | http://twitter.com/search?q=%23BatesMotel                    | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg      | 2972           |
| 48 | #AdviceForPeopleJoiningTwitter |                  | %23AdviceForPeopleJoiningTwitter | 51222        | http://twitter.com/search?q=%23AdviceForPeopleJoiningTwitter | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg      | 2972           |
| 49 | #CADTHSymp                     |                  | %23CADTHSymp                     |              | http://twitter.com/search?q=%23CADTHSymp                     | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg      | 2972           |
| 0  | #WorldPenguinDay               |                  | %23WorldPenguinDay               | 79006        | http://twitter.com/search?q=%23WorldPenguinDay               | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto       | 4118           |
| 1  | #TravelTuesday                 |                  | %23TravelTuesday                 |              | http://twitter.com/search?q=%23TravelTuesday                 | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto       | 4118           |
| 2  | #DigitalLeap                   |                  | %23DigitalLeap                   |              | http://twitter.com/search?q=%23DigitalLeap                   | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto       | 4118           |
| …  | …                              | …                | …                                | …            | …                                                            | …                    | …                    | …             | …              |
| 0  | #nusnc17                       |                  | %23nusnc17                       |              | http://twitter.com/search?q=%23nusnc17                       | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham    | 12723          |
| 1  | #WorldPenguinDay               |                  | %23WorldPenguinDay               | 79006        | http://twitter.com/search?q=%23WorldPenguinDay               | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham    | 12723          |
| 2  | #littleboyblue                 |                  | %23littleboyblue                 | 20772        | http://twitter.com/search?q=%23littleboyblue                 | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham    | 12723          |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+

This is the function that writes the JSON to a DataFrame:

def trends_to_dataframe(data):
    df = pd.DataFrame()

    for location in data:
        temp_df = pd.DataFrame()

        for trend in location[0]['trends']:
            temp_df = temp_df.append(pd.Series(trend), ignore_index=True)

        temp_df['as_of'] = location[0]['as_of']
        temp_df['created_at'] = location[0]['created_at']
        temp_df['location_name'] = location[0]['locations'][0]['name']
        temp_df['location_woeid'] = location[0]['locations'][0]['woeid']

        df = df.append(temp_df)

    return df

Unfortunately, with the amount of data I have (and some simple timers I tested with) this will take about 4 hours to finish. Any thoughts on how to speed this up?

like image 593
metersk Avatar asked Oct 17 '22 13:10

metersk


1 Answers

You could speed things up by async flattening the data with concurrent.futures, then loading it all into a DataFrame with from_records.

from concurrent.futures import ThreadPoolExecutor

def get_trends(location):
    trends = []
    for trend in location[0]['trends']:
        trend['as_of'] = location[0]['as_of']
        trend['created_at'] = location[0]['created_at']
        trend['location_name'] = location[0]['locations'][0]['name']
        trend['location_woeid'] = location[0]['locations'][0]['woeid']
        trends.append(trend)
    return trends

flat_data = []
with ThreadPoolExecutor() as executor:
    for location in data:
        flat_data += get_trends(location)

df = pd.DataFrame.from_records(flat_data)
like image 90
brennan Avatar answered Oct 20 '22 11:10

brennan