Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to define user defined function in pandas

Tags:

python

pandas

I have a csv file that contains information like

name    salary  department
a        2500      x
b        5000      y
c        10000      y
d        20000      x 

I need to convert this using Pandas to the form like

dept    name    position
x        a       Normal Employee
x        b       Normal Employee
y        c       Experienced Employee
y        d       Experienced Employee

if the salary <=8000 Position is Normal Employee

if the salary >8000 && <=25000 Position is Experienced Employee

My default code for group by

import csv
import pandas
pandas.set_option('display.max_rows', 999)
data_df = pandas.read_csv('employeedetails.csv')
#print(data_df.columns)
t = data_df.groupby(['dept'])
print t

What are the changes i need to make in this code to get the output that i mentioned above

like image 873
Edwin Baby Avatar asked Feb 15 '16 16:02

Edwin Baby


People also ask

How do you define user define function?

User-defined functions are functions that you use to organize your code in the body of a policy. Once you define a function, you can call it in the same way as the built-in action and parser functions. Variables that are passed to a function are passed by reference, rather than by value.

How do user-defined functions work in pandas?

You use a Series to Series pandas UDF to vectorize scalar operations. You can use them with APIs such as select and withColumn . The Python function should take a pandas Series as an input and return a pandas Series of the same length, and you should specify these in the Python type hints.

How do pandas define UDF?

Scalar Pandas UDFs are used for vectorizing scalar operations. To define a scalar Pandas UDF, simply use @pandas_udf to annotate a Python function that takes in pandas. Series as arguments and returns another pandas. Series of the same size.


2 Answers

I would use a simple function like:

def f(x):
    if x <= 8000:
        x = 'Normal Employee'
    elif 8000 < x <= 25000:
        x = 'Experienced Employee'
    return x

and then apply it to the df:

df['position'] = df['salary'].apply(f)
like image 99
Fabio Lamanna Avatar answered Nov 14 '22 21:11

Fabio Lamanna


You could define 2 masks and pass these to np.where:

In [91]:
normal = df['salary'] <= 8000
experienced = (df['salary'] > 8000) & (df['salary'] <= 25000)
df['position'] = np.where(normal, 'normal emplyee', np.where(experienced, 'experienced employee', 'unknown'))
df

Out[91]:
  name  salary department              position
0    a    2500          x        normal emplyee
1    b    5000          y        normal emplyee
2    c   10000          y  experienced employee
3    d   20000          x  experienced employee

Or slightly more readable is to pass them to loc:

In [92]:
df.loc[normal, 'position'] = 'normal employee'
df.loc[experienced,'position'] = 'experienced employee'
df

Out[92]:
  name  salary department              position
0    a    2500          x       normal employee
1    b    5000          y       normal employee
2    c   10000          y  experienced employee
3    d   20000          x  experienced employee
like image 34
EdChum Avatar answered Nov 14 '22 21:11

EdChum