Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating difference between first and individual occurence (Python Pandas)

Let's take a look at a simple HR dataframe as an example:

Name Year Department Salary New column: hit100after
John 2018 R&D 80 0
Marie 2018 Legal 90 0
Jill 2018 Legal 100 0
John 2019 R&D 85 0
Marie 2019 Legal 95 0
Jill 2019 Legal 105 1
John 2020 R&D 90 0
Marie 2020 Legal 100 2
Jill 2020 Legal 110 2

Question: How much time (in years) did it take for any data set to achieve a salary >= 100 AFTER the first person in the same department hit that number? (see new column 'hit100after')

I am actually struggeling with the implementation. My steps would be to group by department and find the first occurrence of salary >= 100 and mark that year with a new boolean type or a second dataframe. Then I would calculate the differences between >= 100 of an individual and the first occurrence in its department.

Any idea how to code that for a large dataframe?

like image 616
GGCoolP Avatar asked Feb 20 '26 16:02

GGCoolP


1 Answers

You can groupby() the Department and find the Year.min() where Salary >= 100:

first100 = df.groupby('Department').apply(
    lambda g: g[g.Salary >= 100].Year.min())

# Department
# Legal    2018.0
# R&D         NaN
# dtype: float64

Then map() those first100 years by Department and subtract from Year. The results are in Result for comparison:

df['Result'] = df.Year - df.Department.map(first100)
df.loc[df.Salary < 100, 'Result'] = 0

#     Name  Year Department  Salary  Hit100after  Result
# 0   John  2018        R&D      80            0     0.0
# 1  Marie  2018      Legal      90            0     0.0
# 2   Jill  2018      Legal     100            0     0.0
# 3   John  2019        R&D      85            0     0.0
# 4  Marie  2019      Legal      95            0     0.0
# 5   Jill  2019      Legal     105            1     1.0
# 6   John  2020        R&D      90            0     0.0
# 7  Marie  2020      Legal     100            2     2.0
# 8   Jill  2020      Legal     110            2     2.0

Timings for the current answers

method %timeit with 1 million rows
this answer 3.07 ms ± 23.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
query() answer 4.33 ms ± 354 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 98
tdy Avatar answered Feb 24 '26 19:02

tdy