Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding contiguous, non-unique slices in Pandas series without iterating

I'm trying to parse a logfile of our manufacturing process. Most of the time the process is run automatically but occasionally, the engineer needs to switch into manual mode to make some changes and then switches back to automatic control by the reactor software. When set to manual mode the logfile records the step as being "MAN.OP." instead of a number. Below is a representative example.

steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
ser_orig = pd.Series(steps)

which results in

0           1
1           2
2           2
3     MAN.OP.
4     MAN.OP.
5           2
6           2
7           3
8           3
9     MAN.OP.
10    MAN.OP.
11          4
12          4
dtype: object

I need to detect the 'MAN.OP.' and make them distinct from each other. In this example, the two regions with values == 2 should be one region after detecting the manual mode section like this:

0                 1
1                 2
2                 2
3     Manual_Mode_0
4     Manual_Mode_0
5                 2
6                 2
7                 3
8                 3
9     Manual_Mode_1
10    Manual_Mode_1
11                4
12                4
dtype: object

I have code that iterates over this series and produces the correct result when the series is passed to my object. The setter is:

@step_series.setter
def step_series(self, ss):
    """
    On assignment, give the manual mode steps a unique name. Leave 
    the steps done on recipe the same.
    """
    manual_mode = "MAN.OP."
    new_manual_mode_text = "Manual_Mode_{}"
    counter = 0
    continuous = False
    for i in ss.index:
        if continuous and ss.at[i] != manual_mode:
            continuous = False
            counter += 1

        elif not continuous and ss.at[i] == manual_mode:
            continuous = True
            ss.at[i] = new_manual_mode_text.format(str(counter))

        elif continuous and ss.at[i] == manual_mode:
            ss.at[i] = new_manual_mode_text.format(str(counter))

    self._step_series = ss

but this iterates over the entire dataframe and is the slowest part of my code other than reading the logfile over the network.

How can I detect these non-unique sections and rename them uniquely without iterating over the entire series? The series is a column selection from a larger dataframe so adding extra columns is fine if needed.

For the completed answer I ended up with:

@step_series.setter
def step_series(self, ss):
    pd.options.mode.chained_assignment = None
    manual_mode = "MAN.OP."
    new_manual_mode_text = "Manual_Mode_{}"

    newManOp = (ss=='MAN.OP.') & (ss != ss.shift())
    ss[ss == 'MAN.OP.'] = 'Manual_Mode_' + (newManOp.cumsum()-1).astype(str)

    self._step_series = ss
like image 708
zeppelin_d Avatar asked Sep 25 '22 08:09

zeppelin_d


2 Answers

Here's one way:

steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
steps = pd.Series(steps)

newManOp = (steps=='MAN.OP.') & (steps != steps.shift())
steps[steps=='MAN.OP.'] += seq.cumsum().astype(str)

>>> steps
0            1
1            2
2            2
3     MAN.OP.1
4     MAN.OP.1
5            2
6            2
7            3
8            3
9     MAN.OP.2
10    MAN.OP.2
11           4
12           4
dtype: object

To get the exact format you listed (starting from zero instead of one, and changing from "MAN.OP." to "Manual_mode_"), just tweak the last line:

steps[steps=='MAN.OP.'] = 'Manual_Mode_' + (seq.cumsum()-1).astype(str)

>>> steps
0                 1
1                 2
2                 2
3     Manual_Mode_0
4     Manual_Mode_0
5                 2
6                 2
7                 3
8                 3
9     Manual_Mode_1
10    Manual_Mode_1
11                4
12                4
dtype: object

There a pandas enhancement request for contiguous groupby, which would make this type of task simpler.

like image 147
BrenBarn Avatar answered Sep 28 '22 04:09

BrenBarn


There is s function in matplotlib that takes a boolean array and returns a list of (start, end) pairs. Each pair represents a contiguous region where the input is True.

import matplotlib.mlab as mlab
regions = mlab.contiguous_regions(ser_orig == manual_mode)
for i, (start, end) in enumerate(regions):
    ser_orig[start:end] = new_manual_mode_text.format(i)
ser_orig

0                 1
1                 2
2                 2
3     Manual_Mode_0
4     Manual_Mode_0
5                 2
6                 2
7                 3
8                 3
9     Manual_Mode_1
10    Manual_Mode_1
11                4
12                4
dtype: object
like image 44
Stop harming Monica Avatar answered Sep 28 '22 06:09

Stop harming Monica