I am trying to find a way to do a cumulative total that accounts for ties in Pandas.
Lets take hypothetical data from a track meet, where I have people, races, heats, and time.
Each person's placement is according to the following:
For a given race/heat combination:
and so on...
This would be fairly simple code, but for one thing..
If two people have the same time, they both get the same place and then the next time greater than their time will have that value + 1 as the placement.
In the table below, for 100 yard dash, heat 1, RUNNER1 finished first, RUNNER2/RUNNER3 finished second, and RUNNER3 finished third (next time after RUNNER2/RUNNER3)
So basically, the logic is as follows:
If race <> race.shift() or heat <> heat.shift() then place =1
If race = race.shift() and heat = heat.shift() and time>time.shift then place =place.shift()+1
If race = race.shift() and heat = heat.shift() and time>time.shift then place =place.shift()
The part that confuses me is how to handle the ties. Otherwise I could do something like
df['Place']=np.where(
(df['race']==df['race'].shift())
&
(df['heat']==df['heat'].shift()),
df['Place'].shift()+1,
1)
Thank you!
Sample data follows:
Person,Race,Heat,Time
RUNNER1,100 Yard Dash,1,9.87
RUNNER2,100 Yard Dash,1,9.92
RUNNER3,100 Yard Dash,1,9.92
RUNNER4,100 Yard Dash,1,9.96
RUNNER5,100 Yard Dash,1,9.97
RUNNER6,100 Yard Dash,1,10.01
RUNNER7,100 Yard Dash,2,9.88
RUNNER8,100 Yard Dash,2,9.93
RUNNER9,100 Yard Dash,2,9.93
RUNNER10,100 Yard Dash,2,10.03
RUNNER11,100 Yard Dash,2,10.26
RUNNER7,200 Yard Dash,1,19.63
RUNNER8,200 Yard Dash,1,19.67
RUNNER9,200 Yard Dash,1,19.72
RUNNER10,200 Yard Dash,1,19.72
RUNNER11,200 Yard Dash,1,19.86
RUNNER12,200 Yard Dash,1,19.92
what I want at the end is
Person,Race,Heat,Time,Place
RUNNER1,100 Yard Dash,1,9.87,1
RUNNER2,100 Yard Dash,1,9.92,2
RUNNER3,100 Yard Dash,1,9.92,2
RUNNER4,100 Yard Dash,1,9.96,3
RUNNER5,100 Yard Dash,1,9.97,4
RUNNER6,100 Yard Dash,1,10.01,5
RUNNER7,100 Yard Dash,2,9.88,1
RUNNER8,100 Yard Dash,2,9.93,2
RUNNER9,100 Yard Dash,2,9.93,2
RUNNER10,100 Yard Dash,2,10.03,3
RUNNER11,100 Yard Dash,2,10.26,4
RUNNER7,200 Yard Dash,1,19.63,1
RUNNER8,200 Yard Dash,1,19.67,2
RUNNER9,200 Yard Dash,1,19.72,3
RUNNER10,200 Yard Dash,1,19.72,3
RUNNER11,200 Yard Dash,1,19.86,4
RUNNER12,200 Yard Dash,1,19.92,4
[edit] Now, one step further..
Lets assume that once I leave a set of unique values, the next time that set comes up, the values will reset to 1..
So, for example, - Note that it goes to "heat 1" and then "heat 2" and back to "heat 1" - I don't want the rankings to continue from the prior "heat 1", rather I want them to reset.
Person,Race,Heat,Time,Place
RUNNER1,100 Yard Dash,1,9.87,1
RUNNER2,100 Yard Dash,1,9.92,2
RUNNER3,100 Yard Dash,1,9.92,2
RUNNER4,100 Yard Dash,2,9.96,1
RUNNER5,100 Yard Dash,2,9.97,2
RUNNER6,100 Yard Dash,2,10.01,3
RUNNER7,100 Yard Dash,1,9.88,1
RUNNER8,100 Yard Dash,1,9.93,2
RUNNER9,100 Yard Dash,1,9.93,2
You could use:
grouped = df.groupby(['Race','Heat'])
df['Place'] = grouped['Time'].transform(lambda x: pd.factorize(x, sort=True)[0]+1)
import pandas as pd
df = pd.DataFrame({'Heat': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1], 'Person': ['RUNNER1', 'RUNNER2', 'RUNNER3', 'RUNNER4', 'RUNNER5', 'RUNNER6', 'RUNNER7', 'RUNNER8', 'RUNNER9', 'RUNNER10', 'RUNNER11', 'RUNNER7', 'RUNNER8', 'RUNNER9', 'RUNNER10', 'RUNNER11', 'RUNNER12'], 'Race': ['100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '200 Yard Dash', '200 Yard Dash', '200 Yard Dash', '200 Yard Dash', '200 Yard Dash', '200 Yard Dash'], 'Time': [9.8699999999999992, 9.9199999999999999, 9.9199999999999999, 9.9600000000000009, 9.9700000000000006, 10.01, 9.8800000000000008, 9.9299999999999997, 9.9299999999999997, 10.029999999999999, 10.26, 19.629999999999999, 19.670000000000002, 19.719999999999999, 19.719999999999999, 19.859999999999999, 19.920000000000002]})
grouped = df.groupby(['Race','Heat'])
df['Place'] = grouped['Time'].transform(lambda x: pd.factorize(x, sort=True)[0]+1)
df['Rank'] = grouped['Time'].rank(method='min')
print(df)
yields
Heat Person Race Time Place Rank
0 1 RUNNER1 100 Yard Dash 9.87 1.0 1.0
1 1 RUNNER2 100 Yard Dash 9.92 2.0 2.0
2 1 RUNNER3 100 Yard Dash 9.92 2.0 2.0
3 1 RUNNER4 100 Yard Dash 9.96 3.0 4.0
4 1 RUNNER5 100 Yard Dash 9.97 4.0 5.0
5 1 RUNNER6 100 Yard Dash 10.01 5.0 6.0
6 2 RUNNER7 100 Yard Dash 9.88 1.0 1.0
7 2 RUNNER8 100 Yard Dash 9.93 2.0 2.0
8 2 RUNNER9 100 Yard Dash 9.93 2.0 2.0
9 2 RUNNER10 100 Yard Dash 10.03 3.0 4.0
10 2 RUNNER11 100 Yard Dash 10.26 4.0 5.0
11 1 RUNNER7 200 Yard Dash 19.63 1.0 1.0
12 1 RUNNER8 200 Yard Dash 19.67 2.0 2.0
13 1 RUNNER9 200 Yard Dash 19.72 3.0 3.0
14 1 RUNNER10 200 Yard Dash 19.72 3.0 3.0
15 1 RUNNER11 200 Yard Dash 19.86 4.0 5.0
16 1 RUNNER12 200 Yard Dash 19.92 5.0 6.0
Note that Pandas has a Groupby.rank method which can compute many common forms of rank -- but not the one you described. Notice how for example on row 3 the Rank is 4 after a tie between the second and third runners, while the Place is 3.
Regarding the edit: Use
(df['Heat'] != df['Heat'].shift()).cumsum()
to disambiguate the heats:
import pandas as pd
df = pd.DataFrame({'Heat': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1], 'Person': ['RUNNER1', 'RUNNER2', 'RUNNER3', 'RUNNER4', 'RUNNER5', 'RUNNER6', 'RUNNER7', 'RUNNER8', 'RUNNER9', 'RUNNER10', 'RUNNER11', 'RUNNER7', 'RUNNER8', 'RUNNER9', 'RUNNER10', 'RUNNER11', 'RUNNER12'], 'Race': ['100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash'], 'Time': [9.8699999999999992, 9.9199999999999999, 9.9199999999999999, 9.9600000000000009, 9.9700000000000006, 10.01, 9.8800000000000008, 9.9299999999999997, 9.9299999999999997, 10.029999999999999, 10.26, 19.629999999999999, 19.670000000000002, 19.719999999999999, 19.719999999999999, 19.859999999999999, 19.920000000000002]})
df['HeatGroup'] = (df['Heat'] != df['Heat'].shift()).cumsum()
grouped = df.groupby(['Race','HeatGroup'])
df['Place'] = grouped['Time'].transform(lambda x: pd.factorize(x, sort=True)[0]+1)
df['Rank'] = grouped['Time'].rank(method='min')
print(df)
yields
Heat Person Race Time HeatGroup Place Rank
0 1 RUNNER1 100 Yard Dash 9.87 1 1.0 1.0
1 1 RUNNER2 100 Yard Dash 9.92 1 2.0 2.0
2 1 RUNNER3 100 Yard Dash 9.92 1 2.0 2.0
3 1 RUNNER4 100 Yard Dash 9.96 1 3.0 4.0
4 1 RUNNER5 100 Yard Dash 9.97 1 4.0 5.0
5 1 RUNNER6 100 Yard Dash 10.01 1 5.0 6.0
6 2 RUNNER7 100 Yard Dash 9.88 2 1.0 1.0
7 2 RUNNER8 100 Yard Dash 9.93 2 2.0 2.0
8 2 RUNNER9 100 Yard Dash 9.93 2 2.0 2.0
9 2 RUNNER10 100 Yard Dash 10.03 2 3.0 4.0
10 2 RUNNER11 100 Yard Dash 10.26 2 4.0 5.0
11 1 RUNNER7 100 Yard Dash 19.63 3 1.0 1.0
12 1 RUNNER8 100 Yard Dash 19.67 3 2.0 2.0
13 1 RUNNER9 100 Yard Dash 19.72 3 3.0 3.0
14 1 RUNNER10 100 Yard Dash 19.72 3 3.0 3.0
15 1 RUNNER11 100 Yard Dash 19.86 3 4.0 5.0
16 1 RUNNER12 100 Yard Dash 19.92 3 5.0 6.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With