I have the following dataframe:
data = {'VehID' : pd.Series([10000,10000,10000,10001,10001,10001,10001]),
        'JobNo' : pd.Series([1,2,2,1,2,3,3]),
        'Material' : pd.Series([5005,5100,5005,5888,5222,5888,5222])}
df   = pd.DataFrame(data, columns=['VehID','JobNo','Material'])
It looks like this:
   VehID    JobNo  Material
0  10000      1      5005
1  10000      2      5100
2  10000      2      5005
3  10001      1      5888
4  10001      2      5222
5  10001      3      5888
6  10001      3      5222
I would like to identify the materials that occur in consecutive jobs for every vehicle. For example,
VehID  Material  Jobs
10000    5005    [1,2]
10001    5222    [2,3]
I would like to avoid working with for loops. Does anyone have any suggestions on a neat solution to this? Thanks in advance..
You can first gather data to lists with pandas.DataFrame.groupby and then pandas.DataFrame.apply with list constructor as a function:
>>> res = df.groupby(['VehID', 'Material'])['JobNo'].apply(list).reset_index()
>>> res
   VehID  Material   JobNo
0  10000      5005  [1, 2]
1  10000      5100     [2]
2  10001      5222  [2, 3]
3  10001      5888  [1, 3]
And now you can filter out all non-consecutive lists:
>>> f = res.JobNo.apply(lambda x: len(x) > 1 and sorted(x) == range(min(x), max(x)+1))
>>> res[f]
   VehID  Material   JobNo
0  10000      5005  [1, 2]
2  10001      5222  [2, 3]
You can probably speed it with smarter functions - first store alreadt sorted list in res and then check min, max and len with range of same length
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With