What is the equivalent of this operation in Pyspark?
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)
output
Set Type color
0 Z A green
1 Z B green
2 X B red
3 Y C red
Recommended way of doing this in pandas is using numpy.select which is a vectorized way of doing such operations rather than using apply which is slow. In Pyspark , we can make use of SQL CASE statement with selectExpr
Assume you have a dataframe like below with the dataframe in pandas named as pandas_df and the dataframe in spark is named as spark_df: Now we have a list of columns which we want to add into the dataframe with a default value of 0. In Pyspark we can do the same using the lit function and alias as below:
PySpark PySpark filter () function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression, you can also use where () clause instead of the filter () if you are coming from an SQL background, both these functions operate exactly the same.
In Pyspark we can do the same using the lit function and alias as below: Lets say we have indices where we want to subset a dataframe. Using the same above dataframe , We can use .iloc [] for a pandas dataframe. Assuming the start and end points are as below:
You're looking for pyspark.sql.functions.when()
:
from pyspark.sql.functions import when, col
df = df.withColumn('color', when(col('Set') == 'Z', 'green').otherwise('red'))
df.show()
#+---+----+-----+
#|Set|Type|color|
#+---+----+-----+
#| Z| A|green|
#| Z| B|green|
#| X| B| red|
#| Y| C| red|
#+---+----+-----+
If you have multiple conditions to check, you can chain together calls to when()
as shown in this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With