I have a DataFrame:
df = pd.DataFrame({'ID':['a','b','d','d','a','b','c','b','d','a','b','a'],
'sec':[3,6,2,0,4,7,10,19,40,3,1,2]})
print(df)
ID sec
0 a 3
1 b 6
2 d 2
3 d 0
4 a 4
5 b 7
6 c 10
7 b 19
8 d 40
9 a 3
10 b 1
11 a 2
I want to calculate how many times a transition has occurred. Here in the ID
column a->b
is considered as a transition, similarly for b->d, d->d, d->a, b->c, c->b, b->a
. I can do this using Counter
like:
Counter(zip(df['ID'].to_list(),df['ID'].to_list()[1:]))
Counter({('a', 'b'): 3,
('b', 'd'): 2,
('d', 'd'): 1,
('d', 'a'): 2,
('b', 'c'): 1,
('c', 'b'): 1,
('b', 'a'): 1})
I also need to get min and max of the sec
column of those transitions. Here for example a->b
has occurred 3 times out of them min sec
value is 1
and max sec
value is 7
. Also I want to get where this transition first occurred for a->b
its 0. For the transition_index
column I consider the first value of a transition, i.e. index of a
and for calculating, min, max I take the second value of the transition, i.e. value at b
.
Here is the final output I want to get:
df = pd.DataFrame({'ID_1':['a','b','d','d','b','c','b'],
'ID_2':['b','d','d','a','c','b','a'],
'sec_min':[1,2,0,3,10,19,2],
'sec_max':[7,40,0,4,10,19,2],
'transition_index':[0,1,2,3,5,6,10],
'count':[3,2,1,2,1,1,1]})
print(df)
ID_1 ID_2 sec_min sec_max transition_index count
0 a b 1 7 0 3
1 b d 2 40 1 2
2 d d 0 0 2 1
3 d a 3 4 3 2
4 b c 10 10 5 1
5 c b 19 19 6 1
6 b a 2 2 10 1
How can I achieve this in Python?
Also I have a huge amount of data, so I'm looking for the fastest way possible.
Pandas DataFrame max() Method The max() method returns a Series with the maximum value of each column. By specifying the column axis ( axis='columns' ), the max() method searches column-wise and returns the maximum value for each row.
Pandas DataFrame idxmax() Method The idxmax() method returns a Series with the index of the maximum value for each column. By specifying the column axis ( axis='columns' ), the idxmax() method returns a Series with the index of the maximum value for each row.
idxmax() function returns index of first occurrence of maximum over requested axis. While finding the index of the maximum value across any index, all NA/null values are excluded. Example #1: Use idxmax() function to function to find the index of the maximum value along the index axis.
You have transitions of the form from -> to
. 'transition_index'
is based on the index of the "from" row, while the 'sec'
aggregations are based on the value associated with the "to" row.
We can shift the index and group on the ID and the shifted the ID, allowing us to use a single groupby with named aggregations to get the desired output.
df = df.reset_index()
df['index'] = df['index'].shift().astype('Int64')
(df.groupby([df['ID'].shift(1).rename('ID_1'), df['ID'].rename('ID_2')], sort=False)
.agg(sec_min=('sec', 'min'),
sec_max=('sec', 'max'),
transition_index=('index', 'first'),
count=('sec', 'size'))
.reset_index()
)
ID_1 ID_2 sec_min sec_max transition_index count
0 a b 1 7 0 3
1 b d 2 40 1 2
2 d d 0 0 2 1
3 d a 3 4 3 2
4 b c 10 10 5 1
5 c b 19 19 6 1
6 b a 2 2 10 1
Start from adding columns with previous values of ID and sec:
df['prevID'] = df.ID.shift(fill_value='')
df['prevSec'] = df.sec.shift(fill_value=0)
Then define the following function:
def find(df, IDfrom, IDto):
rows = df.query('prevID == @IDfrom and ID == @IDto')
tbl = rows.loc[:, ['prevSec', 'sec']].values
n = rows.index.size
return (n, tbl.min(), tbl.max()) if n > 0 else (n, 0, 0)
Now if you run this function e.g. to find transitions from a to b:
find(df, 'a', 'b')
you will get:
(3, 1, 7)
Then call this function for all other from and to values.
Note that this function returns proper result even if there is no transition between the given values. Of course, you may choose other "surrogate" values for min and max if no transition has been found.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With