I have a DataFrame with information about employee salary. It's about 900000+ rows.
Sample:
+----+-------------+---------------+----------+
| | table_num | name | salary |
|----+-------------+---------------+----------|
| 0 | 001234 | John Johnson | 1200 |
| 1 | 001234 | John Johnson | 1000 |
| 2 | 001235 | John Johnson | 1000 |
| 3 | 001235 | John Johnson | 1200 |
| 4 | 001235 | John Johnson | 1000 |
| 5 | 001235 | Steve Stevens | 1000 |
| 6 | 001236 | Steve Stevens | 1200 |
| 7 | 001236 | Steve Stevens | 1200 |
| 8 | 001236 | Steve Stevens | 1200 |
+----+-------------+---------------+----------+
dtypes:
table_num: string
name: string
salary: float
I need to add a column with information about increased\decreased salary level.
I'm using the shift() function to compare value in rows.
Main problem is in filtration and iteration over all unique employees over the whole dataset.
It takes about 3 and half hour in my script.
How to do it faster?
My script:
# giving us only unique combination of 'table_num' and 'name'
# since there can be same 'table_num' for different 'name'
# and same names with different 'table_num' appears sometimes
names_df = df[['table_num', 'name']].drop_duplicates()
# then extracting particular name and table_num from Series
for i in range(len(names_df)): ### Bottleneck of whole script ###
t = names_df.iloc[i,[0,1]][0]
n = names_df.iloc[i,[0,1]][1]
# using shift() and lambda to check if there difference between two rows
diff_sal = (df[(df['table_num']==t)
& ((df['name']==n))]['salary'] - df[(df['table_num']==t)
& ((df['name']==n))]['salary'].shift(1)).apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
df.loc[diff_sal.index, 'inc'] = diff_sal.values
Sample input data:
df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})
Sample output:
+----+-------------+---------------+----------+-------+
| | table_num | name | salary | inc |
|----+-------------+---------------+----------+-------|
| 0 | 001234 | John Johnson | 1200 | 0 |
| 1 | 001234 | John Johnson | 1000 | -1 |
| 2 | 001235 | John Johnson | 1000 | 0 |
| 3 | 001235 | John Johnson | 1200 | 1 |
| 4 | 001235 | John Johnson | 1000 | -1 |
| 5 | 001235 | Steve Stevens | 1000 | 0 |
| 6 | 001236 | Steve Stevens | 1200 | 0 |
| 7 | 001236 | Steve Stevens | 1200 | 0 |
| 8 | 001236 | Steve Stevens | 1200 | 0 |
+----+-------------+---------------+----------+-------+
Use groupby together with diff:
df['inc'] = df.groupby(['table_num', 'name'])['salary'].diff().fillna(0.0)
df.loc[df['inc'] > 0.0, 'inc'] = 1.0
df.loc[df['inc'] < 0.0, 'inc'] = -1.0
Use DataFrameGroupBy.diff with numpy.sign and last cast to integers:
df['new'] = np.sign(df.groupby(['table_num', 'name'])['salary'].diff().fillna(0)).astype(int)
print (df)
table_num name salary new
0 1234 John Johnson 1200 0
1 1234 John Johnson 1000 -1
2 1235 John Johnson 1000 0
3 1235 John Johnson 1200 1
4 1235 John Johnson 1000 -1
5 1235 Steve Stevens 1000 0
6 1236 Steve Stevens 1200 0
7 1236 Steve Stevens 1200 0
8 1236 Steve Stevens 1200 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With