I have a PySpark Dataframe with two columns:
+---+----+ | Id|Rank| +---+----+ | a| 5| | b| 7| | c| 8| | d| 1| +---+----+
For each row, I'm looking to replace Id
column with "other" if Rank
column is larger than 5.
If I use pseudocode to explain:
For row in df: if row.Rank > 5: then replace(row.Id, "other")
The result should look like this:
+-----+----+ | Id|Rank| +-----+----+ | a| 5| |other| 7| |other| 8| | d| 1| +-----+----+
Any clue how to achieve this? Thanks!!!
To create this Dataframe:
df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank'])
You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.
You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.
Method 1: Using Lit() function Select table by using select() method and pass the arguments first one is the column name, or “*” for selecting the whole table and second argument pass the lit() function with constant values.
Method 1: Using withColumnRenamed() We will use of withColumnRenamed() method to change the column names of pyspark data frame. existingstr: Existing column name of data frame to rename. newstr: New column name. Returns type: Returns a data frame by renaming an existing column.
You can use when
and otherwise
like -
from pyspark.sql.functions import * df\ .withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\ .drop(df.Id)\ .select(col('Id_New').alias('Id'),col('Rank'))\ .show()
this gives output as -
+-----+----+ | Id|Rank| +-----+----+ | a| 5| |other| 7| |other| 8| | d| 1| +-----+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With