I have a DataFrame: <pre class="prettyprint"><code>df = pd.DataFrame({'ID':['a','b','d','d','a','b','c','b','d','a','b','a'], 'sec':[3,6,2,0,4,7,10,19,40,3,1,2]}) print(df) ID sec 0 a 3 1 b 6 2 d 2 3 d 0 4 a 4 5 b 7 6 c 10 7 b 19 8 d 40 9 a 3 10 b 1 11 a 2 </code></pre> I want to calculate how many times a transition has occurred. Here in the <code>ID</code> column <code>a->b</code> is considered as a transition, similarly for <code>b->d, d->d, d->a, b->c, c->b, b->a</code>. I can do this using <code>Counter</code> like: <pre class="prettyprint"><code>Counter(zip(df['ID'].to_list(),df['ID'].to_list()[1:])) Counter({('a', 'b'): 3, ('b', 'd'): 2, ('d', 'd'): 1, ('d', 'a'): 2, ('b', 'c'): 1, ('c', 'b'): 1, ('b', 'a'): 1}) </code></pre> I also need to get min and max of the <code>sec</code> column of those transitions. Here for example <code>a->b</code> has occurred 3 times out of them min <code>sec</code> value is <code>1</code> and max <code>sec</code> value is <code>7</code>. Also I want to get where this transition first occurred for <code>a->b</code> its 0. For the <code>transition_index</code> column I consider the first value of a transition, i.e. index of <code>a</code> and for calculating, min, max I take the second value of the transition, i.e. value at <code>b</code>. Here is the final output I want to get: <pre class="prettyprint"><code>df = pd.DataFrame({'ID_1':['a','b','d','d','b','c','b'], 'ID_2':['b','d','d','a','c','b','a'], 'sec_min':[1,2,0,3,10,19,2], 'sec_max':[7,40,0,4,10,19,2], 'transition_index':[0,1,2,3,5,6,10], 'count':[3,2,1,2,1,1,1]}) print(df) ID_1 ID_2 sec_min sec_max transition_index count 0 a b 1 7 0 3 1 b d 2 40 1 2 2 d d 0 0 2 1 3 d a 3 4 3 2 4 b c 10 10 5 1 5 c b 19 19 6 1 6 b a 2 2 10 1 </code></pre> How can I achieve this in Python? Also I have a huge amount of data, so I'm looking for the fastest way possible.

You have transitions of the form <code>from -> to</code>. <code>'transition_index'</code> is based on the index of the "from" row, while the <code>'sec'</code> aggregations are based on the value associated with the "to" row. We can shift the index and group on the ID and the shifted the ID, allowing us to use a single groupby with named aggregations to get the desired output. <hr> <pre class="prettyprint"><code>df = df.reset_index() df['index'] = df['index'].shift().astype('Int64') (df.groupby([df['ID'].shift(1).rename('ID_1'), df['ID'].rename('ID_2')], sort=False) .agg(sec_min=('sec', 'min'), sec_max=('sec', 'max'), transition_index=('index', 'first'), count=('sec', 'size')) .reset_index() ) </code></pre> <hr> <pre class="prettyprint"><code> ID_1 ID_2 sec_min sec_max transition_index count 0 a b 1 7 0 3 1 b d 2 40 1 2 2 d d 0 0 2 1 3 d a 3 4 3 2 4 b c 10 10 5 1 5 c b 19 19 6 1 6 b a 2 2 10 1 </code></pre>

Calculate min and max value of a transition with index of first occurrence in pandas

Tags:

python

pandas

numpy

I have a DataFrame:

df = pd.DataFrame({'ID':['a','b','d','d','a','b','c','b','d','a','b','a'], 
                   'sec':[3,6,2,0,4,7,10,19,40,3,1,2]})
print(df)
   ID  sec
0   a    3
1   b    6
2   d    2
3   d    0
4   a    4
5   b    7
6   c   10
7   b   19
8   d   40
9   a    3
10  b    1
11  a    2

I want to calculate how many times a transition has occurred. Here in the ID column a->b is considered as a transition, similarly for b->d, d->d, d->a, b->c, c->b, b->a. I can do this using Counter like:

Counter(zip(df['ID'].to_list(),df['ID'].to_list()[1:]))
Counter({('a', 'b'): 3,
         ('b', 'd'): 2,
         ('d', 'd'): 1,
         ('d', 'a'): 2,
         ('b', 'c'): 1,
         ('c', 'b'): 1,
         ('b', 'a'): 1})

I also need to get min and max of the sec column of those transitions. Here for example a->b has occurred 3 times out of them min sec value is 1 and max sec value is 7. Also I want to get where this transition first occurred for a->b its 0. For the transition_index column I consider the first value of a transition, i.e. index of a and for calculating, min, max I take the second value of the transition, i.e. value at b.

Here is the final output I want to get:

df = pd.DataFrame({'ID_1':['a','b','d','d','b','c','b'], 
                   'ID_2':['b','d','d','a','c','b','a'],
                   'sec_min':[1,2,0,3,10,19,2],
                   'sec_max':[7,40,0,4,10,19,2],
                   'transition_index':[0,1,2,3,5,6,10],
                   'count':[3,2,1,2,1,1,1]})
print(df)
  ID_1 ID_2  sec_min  sec_max  transition_index  count
0    a    b        1        7                 0      3
1    b    d        2       40                 1      2
2    d    d        0        0                 2      1
3    d    a        3        4                 3      2
4    b    c       10       10                 5      1
5    c    b       19       19                 6      1
6    b    a        2        2                10      1

How can I achieve this in Python?

Also I have a huge amount of data, so I'm looking for the fastest way possible.

730

asked Jul 26 '20 17:07

Space Impact

2 Answers

You have transitions of the form from -> to. 'transition_index' is based on the index of the "from" row, while the 'sec' aggregations are based on the value associated with the "to" row.

We can shift the index and group on the ID and the shifted the ID, allowing us to use a single groupby with named aggregations to get the desired output.

df = df.reset_index()
df['index'] = df['index'].shift().astype('Int64')

(df.groupby([df['ID'].shift(1).rename('ID_1'), df['ID'].rename('ID_2')], sort=False)
   .agg(sec_min=('sec', 'min'),
        sec_max=('sec', 'max'),
        transition_index=('index', 'first'),
        count=('sec', 'size'))
   .reset_index()
)

  ID_1 ID_2  sec_min  sec_max  transition_index  count
0    a    b        1        7                 0      3
1    b    d        2       40                 1      2
2    d    d        0        0                 2      1
3    d    a        3        4                 3      2
4    b    c       10       10                 5      1
5    c    b       19       19                 6      1
6    b    a        2        2                10      1

154

answered Oct 22 '22 02:10

ALollz

Start from adding columns with previous values of ID and sec:

df['prevID']  = df.ID.shift(fill_value='')
df['prevSec'] = df.sec.shift(fill_value=0)

Then define the following function:

def find(df, IDfrom, IDto):
    rows = df.query('prevID == @IDfrom and ID == @IDto')
    tbl = rows.loc[:, ['prevSec', 'sec']].values
    n = rows.index.size
    return (n, tbl.min(), tbl.max()) if n > 0 else (n, 0, 0)

Now if you run this function e.g. to find transitions from a to b:

find(df, 'a', 'b')

you will get:

(3, 1, 7)

Then call this function for all other from and to values.

Note that this function returns proper result even if there is no transition between the given values. Of course, you may choose other "surrogate" values for min and max if no transition has been found.

answered Oct 22 '22 02:10

Valdi_Bo

Related questions
                            
                                Why aren't torch.nn.Parameter listed when net is printed?
                            
                                How PyCharm imports differently than system command prompt (Windows)
                            
                                Keras load_model with custom objects doesn't work properly
                            
                                Iterate over two Pytorch tensors at once?
                            
                                Search for bitstring most unlike a set of bitstrings
                            
                                random.randint shows different output in Python 2.x and Python 3.x with same seed
                            
                                What does "RuntimeError: CUDA error: device-side assert triggered" in PyTorch mean?
                            
                                How to generate a time-ordered uid in Python?
                            
                                Calling Go from Python
                            
                                Understanding the total_timesteps parameter in stable-baselines' models
                            
                                Django superuser doesn't have permission to delete models
                            
                                How to change default path of Celery beat service?
                            
                                spaCy - Tokenization of Hyphenated words
                            
                                How to fix 'RuntimeError: `get_session` is not available when using TensorFlow 2.0.'
                            
                                Why doesn't f-strings formatting work for Pandas DataFrames?
                            
                                Pandas- Fill nans up until first non NULL value
                            
                                Plotly: How to reverse axes?
                            
                                Why define create_foo() in a Django models.Manager instead of overriding create()?
                            
                                How to multiply two 2D RFFT arrays (FFTPACK) to be compatible with NumPy's FFT?
                            
                                Ansible error: "The Python 2 bindings for rpm are needed for this module"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With