Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pythonic way of collapsing/grouping a list to aggregating max/min

Lets say I have the following list in python. It is ordered first by Equip, then by Date:

my_list = [
    {'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-01'},
    {'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-02'},
    {'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-03'},
    {'Equip': 'A-1', 'Job': 'Job 2', 'Date': '2018-01-04'},
    {'Equip': 'A-1', 'Job': 'Job 2', 'Date': '2018-01-05'},
    {'Equip': 'A-2', 'Job': 'Job 1', 'Date': '2018-01-03'},
    {'Equip': 'A-2', 'Job': 'Job 3', 'Date': '2018-01-04'},
    {'Equip': 'A-2', 'Job': 'Job 3', 'Date': '2018-01-05'}
]

What I want to do is collapse the list by each set where a given piece of Equipment's job does not change, and grab the first and last date the equipment was there. E.g., this simple example should change to:

list_by_job = [
    {'Equip': 'A-1', 'Job': 'Job 1', 'First': '2018-01-01', 'Last': '2018-01-03'},
    {'Equip': 'A-1', 'Job': 'Job 2', 'First': '2018-01-04', 'Last': '2018-01-05'},
    {'Equip': 'A-2', 'Job': 'Job 1', 'First': '2018-01-03', 'Last': '2018-01-03'},
    {'Equip': 'A-2', 'Job': 'Job 3', 'First': '2018-01-04', 'Last': '2018-01-05'}
]

A couple of things to note:

  1. A-2 on Job 1 is only there for a single day, thus its First and Last Date should be the same.
  2. A piece of equipment could be on a job, leave that job, and come back. In this case, I'd need to see an entry for each time it was on the job, not just one single summary.
  3. As stated before, the list is already sorted first by Equip, then by Date, so that ordering can be assumed. (If there is a better way to sort to accomplish this, I am all ears)

For point 3, the list

my_list = [
    {'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-01'},
    {'Equip': 'A-1', 'Job': 'Job 2', 'Date': '2018-01-02'},
    {'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-03'}
]

should yield

    list_by_job = [
        {'Equip': 'A-1', 'Job': 'Job 1', 'First': '2018-01-01', 'Last': '2018-01-01'},
        {'Equip': 'A-2', 'Job': 'Job 2', 'First': '2018-01-02', 'Last': '2018-01-02'},
        {'Equip': 'A-1', 'Job': 'Job 1', 'First': '2018-01-03', 'Last': '2018-01-03'}
    ]

Currently I am doing so in a simple loop/non-pythonic way:

list_by_job = []

last_entry = None
for entry in my_list:
    if last_entry is None or last_entry['Equip'] != entry['Equip'] or last_entry['Job'] != entry['Job']:
      list_by_job.append({'Equip': entry['Equip'], 'Job': entry['Job'], 'First': entry['Date'], 'Last': entry['Date']})
    else:
      list_by_job[-1]['Last'] = entry['Date']
    last_entry = entry

Is there a more pythonic way to do this using Python's list comprehension, etc?

like image 557
MarkD Avatar asked Nov 04 '18 19:11

MarkD


People also ask

How do you do aggregation in pandas?

Pandas DataFrame aggregate() MethodThe aggregate() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis. Note: the agg() method is an alias of the aggregate() method.

What is grouping in Python?

What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.


3 Answers

You can use itertools.groupby:

import itertools
def _key(d):
  return (d['Equip'], d['Job'])

my_list = [{'Date': '2018-01-01', 'Equip': 'A-1', 'Job': 'Job 1'}, {'Date': '2018-01-02', 'Equip': 'A-1', 'Job': 'Job 1'}, {'Date': '2018-01-03', 'Equip': 'A-1', 'Job': 'Job 1'}, {'Date': '2018-01-04', 'Equip': 'A-1', 'Job': 'Job 2'}, {'Date': '2018-01-05', 'Equip': 'A-1', 'Job': 'Job 2'}, {'Date': '2018-01-03', 'Equip': 'A-2', 'Job': 'Job 1'}, {'Date': '2018-01-04', 'Equip': 'A-2', 'Job': 'Job 3'}, {'Date': '2018-01-05', 'Equip': 'A-2', 'Job': 'Job 3'}]
new_data = [[a, list(b)] for a, b in itertools.groupby(my_list, key=_key)]
final_result = [{"Equip":c, 'Job':d, 'First':b[0]['Date'], 'Last':b[-1]['Date']} for [c, d], b in new_data]

Output:

[{'Equip': 'A-1', 'Job': 'Job 1', 'Last': '2018-01-03', 'First': '2018-01-01'}, 
 {'Equip': 'A-1', 'Job': 'Job 2', 'Last': '2018-01-05', 'First': '2018-01-04'}, 
 {'Equip': 'A-2', 'Job': 'Job 1', 'Last': '2018-01-03', 'First': '2018-01-03'}, 
 {'Equip': 'A-2', 'Job': 'Job 3', 'Last': '2018-01-05', 'First': '2018-01-04'}]

Edit:

Using data as suggested in your comment:

my_list = [{'Date': '2018-01-01', 'Equip': 'A-1', 'Job': 'Job 1'}, {'Date': '2018-01-02', 'Equip': 'A-1', 'Job': 'Job 2'}, {'Date': '2018-01-03', 'Equip': 'A-1', 'Job': 'Job 1'}, {'Date': '2018-01-04', 'Equip': 'A-1', 'Job': 'Job 2'}, {'Date': '2018-01-05', 'Equip': 'A-1', 'Job': 'Job 2'}, {'Date': '2018-01-03', 'Equip': 'A-2', 'Job': 'Job 1'}, {'Date': '2018-01-04', 'Equip': 'A-2', 'Job': 'Job 3'}, {'Date': '2018-01-05', 'Equip': 'A-2', 'Job': 'Job 3'}]

Output:

[{'Equip': 'A-1', 'Job': 'Job 1', 'Last': '2018-01-01', 'First': '2018-01-01'}, 
 {'Equip': 'A-1', 'Job': 'Job 2', 'Last': '2018-01-02', 'First': '2018-01-02'}, 
 {'Equip': 'A-1', 'Job': 'Job 1', 'Last': '2018-01-03', 'First': '2018-01-03'}, 
 {'Equip': 'A-1', 'Job': 'Job 2', 'Last': '2018-01-05', 'First': '2018-01-04'}, 
 {'Equip': 'A-2', 'Job': 'Job 1', 'Last': '2018-01-03', 'First': '2018-01-03'}, 
 {'Equip': 'A-2', 'Job': 'Job 3', 'Last': '2018-01-05', 'First': '2018-01-04'}]
like image 174
Ajax1234 Avatar answered Nov 05 '22 19:11

Ajax1234


I suggest using pandas for this.

itertools.groupby is cool but IMO a bit harder to comprehend.

>>> import pandas as pd
>>>
>>> my_list = [
...:    {'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-01'},
...:    {'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-02'},
...:    {'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-03'},
...:    {'Equip': 'A-1', 'Job': 'Job 2', 'Date': '2018-01-04'},
...:    {'Equip': 'A-1', 'Job': 'Job 2', 'Date': '2018-01-05'},
...:    {'Equip': 'A-2', 'Job': 'Job 1', 'Date': '2018-01-03'},
...:    {'Equip': 'A-2', 'Job': 'Job 3', 'Date': '2018-01-04'},
...:    {'Equip': 'A-2', 'Job': 'Job 3', 'Date': '2018-01-05'}
...:]
>>>
>>> df = pd.DataFrame(my_list)
>>> df['Date'] = pd.to_datetime(df['Date'])
>>> groups = df.groupby(['Equip', 'Job']).agg({'Date': [min, max]}).reset_index()    
>>> groups.columns = ['Equip', 'Job', 'First', 'Last']
>>> groups
>>> 
  Equip    Job      First       Last
0   A-1  Job 1 2018-01-01 2018-01-03
1   A-1  Job 2 2018-01-04 2018-01-05
2   A-2  Job 1 2018-01-03 2018-01-03
3   A-2  Job 3 2018-01-04 2018-01-05
>>>
>>> groups.to_dict(orient='records')
>>> 
[{'Equip': 'A-1',
  'First': Timestamp('2018-01-01 00:00:00'),
  'Job': 'Job 1',
  'Last': Timestamp('2018-01-03 00:00:00')},
 {'Equip': 'A-1',
  'First': Timestamp('2018-01-04 00:00:00'),
  'Job': 'Job 2',
  'Last': Timestamp('2018-01-05 00:00:00')},
 {'Equip': 'A-2',
  'First': Timestamp('2018-01-03 00:00:00'),
  'Job': 'Job 1',
  'Last': Timestamp('2018-01-03 00:00:00')},
 {'Equip': 'A-2',
  'First': Timestamp('2018-01-04 00:00:00'),
  'Job': 'Job 3',
  'Last': Timestamp('2018-01-05 00:00:00')}]

I suggest keeping the dates as time stamps.

like image 25
timgeb Avatar answered Nov 05 '22 19:11

timgeb


You can use pandas here, which is some sort of "database interface" for data:

import pandas as pd

df = pd.DataFrame(my_list)
df2 = df.groupby(['Equip', 'Job']).agg(['min', 'max']).rename(columns={'min': 'First', 'max': 'Last'})
df2.columns = df2.columns.droplevel()
df2 = df2.reset_index()
result = df2.to_dict('records')

for the given sample input, this gives:

>>> df2.to_dict('records')
[{'Equip': 'A-1', 'Job': 'Job 1', 'First': '2018-01-01', 'Last': '2018-01-03'},
 {'Equip': 'A-1', 'Job': 'Job 2', 'First': '2018-01-04', 'Last': '2018-01-05'},
 {'Equip': 'A-2', 'Job': 'Job 1', 'First': '2018-01-03', 'Last': '2018-01-03'},
 {'Equip': 'A-2', 'Job': 'Job 3', 'First': '2018-01-04', 'Last': '2018-01-05'}]

In case the date format is not '%Y-%m-%d', then one first needs to convert it with pd.to_datetime(..) like:

import pandas as pd

df = pd.DataFrame(my_list)
df['Date'] = pd.to_datetime(df['Date'])
df2 = df.groupby(['Equip', 'Job']).agg(['min', 'max']).rename(columns={'min': 'First', 'max': 'Last'})
df2.columns = df2.columns.droplevel()
df2 = df2.reset_index()
result = df2.to_dict('records')
like image 36
Willem Van Onsem Avatar answered Nov 05 '22 19:11

Willem Van Onsem