Calculating difference between first and individual occurence (Python Pandas)

Question

Let's take a look at a simple HR dataframe as an example:

Name	Year	Department	Salary	New column: hit100after
John	2018	R&D	80	0
Marie	2018	Legal	90	0
Jill	2018	Legal	100	0
John	2019	R&D	85	0
Marie	2019	Legal	95	0
Jill	2019	Legal	105	1
John	2020	R&D	90	0
Marie	2020	Legal	100	2
Jill	2020	Legal	110	2

Question: How much time (in years) did it take for any data set to achieve a salary >= 100 AFTER the first person in the same department hit that number? (see new column 'hit100after')

I am actually struggeling with the implementation. My steps would be to group by department and find the first occurrence of salary >= 100 and mark that year with a new boolean type or a second dataframe. Then I would calculate the differences between >= 100 of an individual and the first occurrence in its department.

Any idea how to code that for a large dataframe?

tdy · Accepted Answer

You can groupby() the Department and find the Year.min() where Salary >= 100:

first100 = df.groupby('Department').apply(
    lambda g: g[g.Salary >= 100].Year.min())

# Department
# Legal    2018.0
# R&D         NaN
# dtype: float64

Then map() those first100 years by Department and subtract from Year. The results are in Result for comparison:

df['Result'] = df.Year - df.Department.map(first100)
df.loc[df.Salary < 100, 'Result'] = 0

#     Name  Year Department  Salary  Hit100after  Result
# 0   John  2018        R&D      80            0     0.0
# 1  Marie  2018      Legal      90            0     0.0
# 2   Jill  2018      Legal     100            0     0.0
# 3   John  2019        R&D      85            0     0.0
# 4  Marie  2019      Legal      95            0     0.0
# 5   Jill  2019      Legal     105            1     1.0
# 6   John  2020        R&D      90            0     0.0
# 7  Marie  2020      Legal     100            2     2.0
# 8   Jill  2020      Legal     110            2     2.0

Timings for the current answers

method	`%timeit` with 1 million rows
this answer	3.07 ms ± 23.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
`query()` answer	4.33 ms ± 354 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Calculating difference between first and individual occurence (Python Pandas)

Tags:

python

python-3.x

pandas

data-science

GGCoolP

1 Answers

Timings for the current answers

tdy

Recent Activity

Donate For Us

Calculating difference between first and individual occurence (Python Pandas)

Tags:

python

python-3.x

pandas

data-science

GGCoolP

1 Answers

Timings for the current answers

tdy

Related questions

Recent Activity

Donate For Us