I have a csv file that contains information like
name    salary  department
a        2500      x
b        5000      y
c        10000      y
d        20000      x 
I need to convert this using Pandas to the form like
dept    name    position
x        a       Normal Employee
x        b       Normal Employee
y        c       Experienced Employee
y        d       Experienced Employee
if the salary <=8000 Position is Normal Employee
if the salary >8000 && <=25000 Position is Experienced Employee
My default code for group by
import csv
import pandas
pandas.set_option('display.max_rows', 999)
data_df = pandas.read_csv('employeedetails.csv')
#print(data_df.columns)
t = data_df.groupby(['dept'])
print t
What are the changes i need to make in this code to get the output that i mentioned above
User-defined functions are functions that you use to organize your code in the body of a policy. Once you define a function, you can call it in the same way as the built-in action and parser functions. Variables that are passed to a function are passed by reference, rather than by value.
You use a Series to Series pandas UDF to vectorize scalar operations. You can use them with APIs such as select and withColumn . The Python function should take a pandas Series as an input and return a pandas Series of the same length, and you should specify these in the Python type hints.
Scalar Pandas UDFs are used for vectorizing scalar operations. To define a scalar Pandas UDF, simply use @pandas_udf to annotate a Python function that takes in pandas. Series as arguments and returns another pandas. Series of the same size.
I would use a simple function like:
def f(x):
    if x <= 8000:
        x = 'Normal Employee'
    elif 8000 < x <= 25000:
        x = 'Experienced Employee'
    return x
and then apply it to the df:
df['position'] = df['salary'].apply(f)
                        You could define 2 masks and pass these to np.where:
In [91]:
normal = df['salary'] <= 8000
experienced = (df['salary'] > 8000) & (df['salary'] <= 25000)
df['position'] = np.where(normal, 'normal emplyee', np.where(experienced, 'experienced employee', 'unknown'))
df
Out[91]:
  name  salary department              position
0    a    2500          x        normal emplyee
1    b    5000          y        normal emplyee
2    c   10000          y  experienced employee
3    d   20000          x  experienced employee
Or slightly more readable is to pass them to loc:
In [92]:
df.loc[normal, 'position'] = 'normal employee'
df.loc[experienced,'position'] = 'experienced employee'
df
Out[92]:
  name  salary department              position
0    a    2500          x       normal employee
1    b    5000          y       normal employee
2    c   10000          y  experienced employee
3    d   20000          x  experienced employee
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With