Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: Equivalent of np.where [duplicate]

Tags:

pandas

pyspark

What is the equivalent of this operation in Pyspark?

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

output

   Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red
like image 503
adil blanco Avatar asked Mar 28 '18 16:03

adil blanco


People also ask

How to use NumPy select in pyspark instead of apply?

Recommended way of doing this in pandas is using numpy.select which is a vectorized way of doing such operations rather than using apply which is slow. In Pyspark , we can make use of SQL CASE statement with selectExpr

How to add a column with default value of 0 in pyspark?

Assume you have a dataframe like below with the dataframe in pandas named as pandas_df and the dataframe in spark is named as spark_df: Now we have a list of columns which we want to add into the dataframe with a default value of 0. In Pyspark we can do the same using the lit function and alias as below:

How to use filter () function in pyspark?

PySpark PySpark filter () function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression, you can also use where () clause instead of the filter () if you are coming from an SQL background, both these functions operate exactly the same.

How to subset A Pandas Dataframe using pyspark?

In Pyspark we can do the same using the lit function and alias as below: Lets say we have indices where we want to subset a dataframe. Using the same above dataframe , We can use .iloc [] for a pandas dataframe. Assuming the start and end points are as below:


1 Answers

You're looking for pyspark.sql.functions.when():

from pyspark.sql.functions import when, col

df = df.withColumn('color', when(col('Set') == 'Z', 'green').otherwise('red'))
df.show()
#+---+----+-----+
#|Set|Type|color|
#+---+----+-----+
#|  Z|   A|green|
#|  Z|   B|green|
#|  X|   B|  red|
#|  Y|   C|  red|
#+---+----+-----+

If you have multiple conditions to check, you can chain together calls to when() as shown in this answer.

like image 141
pault Avatar answered Oct 17 '22 20:10

pault