Lets say I have the following list in python. It is ordered first by Equip, then by Date:
my_list = [
{'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-01'},
{'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-02'},
{'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-03'},
{'Equip': 'A-1', 'Job': 'Job 2', 'Date': '2018-01-04'},
{'Equip': 'A-1', 'Job': 'Job 2', 'Date': '2018-01-05'},
{'Equip': 'A-2', 'Job': 'Job 1', 'Date': '2018-01-03'},
{'Equip': 'A-2', 'Job': 'Job 3', 'Date': '2018-01-04'},
{'Equip': 'A-2', 'Job': 'Job 3', 'Date': '2018-01-05'}
]
What I want to do is collapse the list by each set where a given piece of Equipment's job does not change, and grab the first and last date the equipment was there. E.g., this simple example should change to:
list_by_job = [
{'Equip': 'A-1', 'Job': 'Job 1', 'First': '2018-01-01', 'Last': '2018-01-03'},
{'Equip': 'A-1', 'Job': 'Job 2', 'First': '2018-01-04', 'Last': '2018-01-05'},
{'Equip': 'A-2', 'Job': 'Job 1', 'First': '2018-01-03', 'Last': '2018-01-03'},
{'Equip': 'A-2', 'Job': 'Job 3', 'First': '2018-01-04', 'Last': '2018-01-05'}
]
A couple of things to note:
A-2
on Job 1
is only there for a single day, thus its First
and Last
Date should be the same.For point 3, the list
my_list = [
{'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-01'},
{'Equip': 'A-1', 'Job': 'Job 2', 'Date': '2018-01-02'},
{'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-03'}
]
should yield
list_by_job = [
{'Equip': 'A-1', 'Job': 'Job 1', 'First': '2018-01-01', 'Last': '2018-01-01'},
{'Equip': 'A-2', 'Job': 'Job 2', 'First': '2018-01-02', 'Last': '2018-01-02'},
{'Equip': 'A-1', 'Job': 'Job 1', 'First': '2018-01-03', 'Last': '2018-01-03'}
]
Currently I am doing so in a simple loop/non-pythonic way:
list_by_job = []
last_entry = None
for entry in my_list:
if last_entry is None or last_entry['Equip'] != entry['Equip'] or last_entry['Job'] != entry['Job']:
list_by_job.append({'Equip': entry['Equip'], 'Job': entry['Job'], 'First': entry['Date'], 'Last': entry['Date']})
else:
list_by_job[-1]['Last'] = entry['Date']
last_entry = entry
Is there a more pythonic way to do this using Python's list comprehension, etc?
Pandas DataFrame aggregate() MethodThe aggregate() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis. Note: the agg() method is an alias of the aggregate() method.
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
You can use itertools.groupby
:
import itertools
def _key(d):
return (d['Equip'], d['Job'])
my_list = [{'Date': '2018-01-01', 'Equip': 'A-1', 'Job': 'Job 1'}, {'Date': '2018-01-02', 'Equip': 'A-1', 'Job': 'Job 1'}, {'Date': '2018-01-03', 'Equip': 'A-1', 'Job': 'Job 1'}, {'Date': '2018-01-04', 'Equip': 'A-1', 'Job': 'Job 2'}, {'Date': '2018-01-05', 'Equip': 'A-1', 'Job': 'Job 2'}, {'Date': '2018-01-03', 'Equip': 'A-2', 'Job': 'Job 1'}, {'Date': '2018-01-04', 'Equip': 'A-2', 'Job': 'Job 3'}, {'Date': '2018-01-05', 'Equip': 'A-2', 'Job': 'Job 3'}]
new_data = [[a, list(b)] for a, b in itertools.groupby(my_list, key=_key)]
final_result = [{"Equip":c, 'Job':d, 'First':b[0]['Date'], 'Last':b[-1]['Date']} for [c, d], b in new_data]
Output:
[{'Equip': 'A-1', 'Job': 'Job 1', 'Last': '2018-01-03', 'First': '2018-01-01'},
{'Equip': 'A-1', 'Job': 'Job 2', 'Last': '2018-01-05', 'First': '2018-01-04'},
{'Equip': 'A-2', 'Job': 'Job 1', 'Last': '2018-01-03', 'First': '2018-01-03'},
{'Equip': 'A-2', 'Job': 'Job 3', 'Last': '2018-01-05', 'First': '2018-01-04'}]
Edit:
Using data as suggested in your comment:
my_list = [{'Date': '2018-01-01', 'Equip': 'A-1', 'Job': 'Job 1'}, {'Date': '2018-01-02', 'Equip': 'A-1', 'Job': 'Job 2'}, {'Date': '2018-01-03', 'Equip': 'A-1', 'Job': 'Job 1'}, {'Date': '2018-01-04', 'Equip': 'A-1', 'Job': 'Job 2'}, {'Date': '2018-01-05', 'Equip': 'A-1', 'Job': 'Job 2'}, {'Date': '2018-01-03', 'Equip': 'A-2', 'Job': 'Job 1'}, {'Date': '2018-01-04', 'Equip': 'A-2', 'Job': 'Job 3'}, {'Date': '2018-01-05', 'Equip': 'A-2', 'Job': 'Job 3'}]
Output:
[{'Equip': 'A-1', 'Job': 'Job 1', 'Last': '2018-01-01', 'First': '2018-01-01'},
{'Equip': 'A-1', 'Job': 'Job 2', 'Last': '2018-01-02', 'First': '2018-01-02'},
{'Equip': 'A-1', 'Job': 'Job 1', 'Last': '2018-01-03', 'First': '2018-01-03'},
{'Equip': 'A-1', 'Job': 'Job 2', 'Last': '2018-01-05', 'First': '2018-01-04'},
{'Equip': 'A-2', 'Job': 'Job 1', 'Last': '2018-01-03', 'First': '2018-01-03'},
{'Equip': 'A-2', 'Job': 'Job 3', 'Last': '2018-01-05', 'First': '2018-01-04'}]
I suggest using pandas
for this.
itertools.groupby
is cool but IMO a bit harder to comprehend.
>>> import pandas as pd
>>>
>>> my_list = [
...: {'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-01'},
...: {'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-02'},
...: {'Equip': 'A-1', 'Job': 'Job 1', 'Date': '2018-01-03'},
...: {'Equip': 'A-1', 'Job': 'Job 2', 'Date': '2018-01-04'},
...: {'Equip': 'A-1', 'Job': 'Job 2', 'Date': '2018-01-05'},
...: {'Equip': 'A-2', 'Job': 'Job 1', 'Date': '2018-01-03'},
...: {'Equip': 'A-2', 'Job': 'Job 3', 'Date': '2018-01-04'},
...: {'Equip': 'A-2', 'Job': 'Job 3', 'Date': '2018-01-05'}
...:]
>>>
>>> df = pd.DataFrame(my_list)
>>> df['Date'] = pd.to_datetime(df['Date'])
>>> groups = df.groupby(['Equip', 'Job']).agg({'Date': [min, max]}).reset_index()
>>> groups.columns = ['Equip', 'Job', 'First', 'Last']
>>> groups
>>>
Equip Job First Last
0 A-1 Job 1 2018-01-01 2018-01-03
1 A-1 Job 2 2018-01-04 2018-01-05
2 A-2 Job 1 2018-01-03 2018-01-03
3 A-2 Job 3 2018-01-04 2018-01-05
>>>
>>> groups.to_dict(orient='records')
>>>
[{'Equip': 'A-1',
'First': Timestamp('2018-01-01 00:00:00'),
'Job': 'Job 1',
'Last': Timestamp('2018-01-03 00:00:00')},
{'Equip': 'A-1',
'First': Timestamp('2018-01-04 00:00:00'),
'Job': 'Job 2',
'Last': Timestamp('2018-01-05 00:00:00')},
{'Equip': 'A-2',
'First': Timestamp('2018-01-03 00:00:00'),
'Job': 'Job 1',
'Last': Timestamp('2018-01-03 00:00:00')},
{'Equip': 'A-2',
'First': Timestamp('2018-01-04 00:00:00'),
'Job': 'Job 3',
'Last': Timestamp('2018-01-05 00:00:00')}]
I suggest keeping the dates as time stamps.
You can use pandas here, which is some sort of "database interface" for data:
import pandas as pd
df = pd.DataFrame(my_list)
df2 = df.groupby(['Equip', 'Job']).agg(['min', 'max']).rename(columns={'min': 'First', 'max': 'Last'})
df2.columns = df2.columns.droplevel()
df2 = df2.reset_index()
result = df2.to_dict('records')
for the given sample input, this gives:
>>> df2.to_dict('records')
[{'Equip': 'A-1', 'Job': 'Job 1', 'First': '2018-01-01', 'Last': '2018-01-03'},
{'Equip': 'A-1', 'Job': 'Job 2', 'First': '2018-01-04', 'Last': '2018-01-05'},
{'Equip': 'A-2', 'Job': 'Job 1', 'First': '2018-01-03', 'Last': '2018-01-03'},
{'Equip': 'A-2', 'Job': 'Job 3', 'First': '2018-01-04', 'Last': '2018-01-05'}]
In case the date format is not '%Y-%m-%d'
, then one first needs to convert it with pd.to_datetime(..)
like:
import pandas as pd
df = pd.DataFrame(my_list)
df['Date'] = pd.to_datetime(df['Date'])
df2 = df.groupby(['Equip', 'Job']).agg(['min', 'max']).rename(columns={'min': 'First', 'max': 'Last'})
df2.columns = df2.columns.droplevel()
df2 = df2.reset_index()
result = df2.to_dict('records')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With