Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace pandas groupby and apply to increase performance

I am using pandas groupby and apply to go from a DataFrame containing 150 million rows with the following columns:

Id  Created     Item    Stock   Price
1   2019-01-01  Item 1  200     10
1   2019-01-01  Item 2  100     15
2   2019-01-01  Item 1  200     10

To a list of 2,2 million records that looks like this:

[{
  "Id": 1,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10},
    {"Item":"Item 2", "Stock": 100, "Price": 5}
    ]
},
{
  "Id": 2,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10}
    ]
}]

Mainly using this line of code:

df.groupby(['Id', 'Created']).apply(lambda x: x[['Item', 'Stock', 'Price']].to_dict(orient='records'))

This takes quite some time and as I understand it, operations like this is heavy for pandas to perform. Is there a none-pandas way to accomplish the same but with greater performance?

Edit: The operation takes 55 minutes, I am using ScriptProcessor in AWS that lets me specify the amount of power I want.

Edit 2: So with artonas solution I am getting close: This is what I manage to produce now:

defaultdict(<function __main__.<lambda>()>,
            {'1': defaultdict(list,
                         {'Id': '1',
                          'Created':'2019-01-01',
                          'Items': [{'Item': Item2, 'Stock': 100, 'Price': 15},
                                    {'Item': Item1, 'Stock': 200, 'Price': 10}]
                         })
            },
           {'2': defaultdict(list,
                         {'Id': '2',
                          'Created':'2019-01-01',
                          'Items': [{'Item': Item1, 'Stock': 200, 'Price': 10}]
                         })
            },

But how to go from the above, to this?

[{
  "Id": 1,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10},
    {"Item":"Item 2", "Stock": 100, "Price": 5}
    ]
},
{
  "Id": 2,
  "Created": "2019-01-01",
  "Items": [
    {"Item":"Item 1", "Stock": 200, "Price": 10}
    ]
}]

Basically Im only intrested in the part after "defaultdict(list, " for all records. I need to have it in a list that is not dependent on the Id as the key.

Edit 3: Last update containing the results for my production dataset. With the accepted answer provided by artona I managed to go from 55 minutes to 7(!) minutes. And without any major changes to my code. The solution provided by Phung Duy Phong took me from 55 minutes to 17, not to bad either.

like image 612
Josef Avatar asked Dec 11 '25 19:12

Josef


1 Answers

Use collections.defaultdict and itertuples. It iterates over row only one time.

In [105]: %timeit df.groupby(['Id', 'Created']).apply(lambda x: x[['Item', 'Stock', 'Price']].to_dict(orient='records'))
10.1 s ± 44.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [107]:from collections import defaultdict
     ...:def create_dict():
     ...:     dict_ids = defaultdict(lambda : defaultdict(list))
     ...:     for row in df.itertuples():
     ...:          dict_ids[row.Id][row.Created].append({"Item": row.Item, "Stock": row.Stock, "Price": row.Price})
     ...:     list_of_dicts = [{"Id":key_id, "Created":key_created, "Items": values} for key_id, value_id in dict_ids.items() for key_created, values in value_id.items()]
     ...:     return list_of_dicts

In [108]: %timeit create_dict()
4.58 s ± 417 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 168
artona Avatar answered Dec 14 '25 08:12

artona



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!