I have the following dataframe:
data = {'VehID' : pd.Series([10000,10000,10000,10001,10001,10001,10001]),
'JobNo' : pd.Series([1,2,2,1,2,3,3]),
'Material' : pd.Series([5005,5100,5005,5888,5222,5888,5222])}
df = pd.DataFrame(data, columns=['VehID','JobNo','Material'])
It looks like this:
VehID JobNo Material
0 10000 1 5005
1 10000 2 5100
2 10000 2 5005
3 10001 1 5888
4 10001 2 5222
5 10001 3 5888
6 10001 3 5222
I would like to identify the materials that occur in consecutive jobs for every vehicle. For example,
VehID Material Jobs
10000 5005 [1,2]
10001 5222 [2,3]
I would like to avoid working with for loops. Does anyone have any suggestions on a neat solution to this? Thanks in advance..
You can first gather data to lists with pandas.DataFrame.groupby
and then pandas.DataFrame.apply
with list
constructor as a function:
>>> res = df.groupby(['VehID', 'Material'])['JobNo'].apply(list).reset_index()
>>> res
VehID Material JobNo
0 10000 5005 [1, 2]
1 10000 5100 [2]
2 10001 5222 [2, 3]
3 10001 5888 [1, 3]
And now you can filter out all non-consecutive lists:
>>> f = res.JobNo.apply(lambda x: len(x) > 1 and sorted(x) == range(min(x), max(x)+1))
>>> res[f]
VehID Material JobNo
0 10000 5005 [1, 2]
2 10001 5222 [2, 3]
You can probably speed it with smarter functions - first store alreadt sorted list in res
and then check min, max and len with range of same length
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With