Pyspark: Equivalent of np.where [duplicate]

Tags:

pandas

pyspark

What is the equivalent of this operation in Pyspark?

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

output

   Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

503

asked Mar 28 '18 16:03

adil blanco

1 Answers

You're looking for pyspark.sql.functions.when():

from pyspark.sql.functions import when, col

df = df.withColumn('color', when(col('Set') == 'Z', 'green').otherwise('red'))
df.show()
#+---+----+-----+
#|Set|Type|color|
#+---+----+-----+
#|  Z|   A|green|
#|  Z|   B|green|
#|  X|   B|  red|
#|  Y|   C|  red|
#+---+----+-----+

If you have multiple conditions to check, you can chain together calls to when() as shown in this answer.

141

answered Oct 17 '22 20:10

pault

Related questions
                            
                                How to plot multiple linear regressions in the same figure
                            
                                Print the raw value of a column alone, in pandas?
                            
                                Changing X axis labels in seaborn boxplot
                            
                                dataframe representation of a rolling window
                            
                                UnicodeDecodeError with pandas.read_sql
                            
                                Append empty rows to Dataframe in pandas
                            
                                Pandas, filter rows which column contain another column
                            
                                Return Pandas dataframe as JSONP response in Python Flask
                            
                                Python/Pandas: How to Match List of Strings with a DataFrame column
                            
                                Pandas DataFrame Table Vertical Scrollbars
                            
                                how to keep numpy array when saving pandas dataframe to csv
                            
                                boolean operation with groupby in pandas
                            
                                Add a series to existing DataFrame
                            
                                DataFrame element-wise divide by sum of row inplace
                            
                                Pandas automatically converts row to column
                            
                                Pandas Assign Lambda Function
                            
                                Pandas pivot_table preserve order
                            
                                Check for None in pandas dataframe
                            
                                sklearn-LinearRegression: could not convert string to float: '--'
                            
                                "No numeric types to aggregate" after groupby and mean

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With