Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: modify column values when another column value satisfies a condition

I have a PySpark Dataframe with two columns:

+---+----+ | Id|Rank| +---+----+ |  a|   5| |  b|   7| |  c|   8| |  d|   1| +---+----+ 

For each row, I'm looking to replace Id column with "other" if Rank column is larger than 5.

If I use pseudocode to explain:

For row in df:   if row.Rank > 5:      then replace(row.Id, "other") 

The result should look like this:

+-----+----+ |   Id|Rank| +-----+----+ |    a|   5| |other|   7| |other|   8| |    d|   1| +-----+----+ 

Any clue how to achieve this? Thanks!!!


To create this Dataframe:

df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank']) 
like image 774
Yuehan Lyu Avatar asked May 15 '17 21:05

Yuehan Lyu


People also ask

How do you replace values in a column based on condition in PySpark?

You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.

How do you update column values in PySpark?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.

How do I assign a value to a column in a DataFrame in PySpark?

Method 1: Using Lit() function Select table by using select() method and pass the arguments first one is the column name, or “*” for selecting the whole table and second argument pass the lit() function with constant values.

How do you rename column values in PySpark?

Method 1: Using withColumnRenamed() We will use of withColumnRenamed() method to change the column names of pyspark data frame. existingstr: Existing column name of data frame to rename. newstr: New column name. Returns type: Returns a data frame by renaming an existing column.


1 Answers

You can use when and otherwise like -

from pyspark.sql.functions import *  df\ .withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\ .drop(df.Id)\ .select(col('Id_New').alias('Id'),col('Rank'))\ .show() 

this gives output as -

+-----+----+ |   Id|Rank| +-----+----+ |    a|   5| |other|   7| |other|   8| |    d|   1| +-----+----+ 
like image 181
Pushkr Avatar answered Sep 22 '22 10:09

Pushkr