I have a PySpark Dataframe with two columns: <pre class="prettyprint"><code>+---+----+ | Id|Rank| +---+----+ | a| 5| | b| 7| | c| 8| | d| 1| +---+----+ </code></pre> For each row, I'm looking to replace <code>Id</code> column with "other" if <code>Rank</code> column is larger than 5. If I use pseudocode to explain: <pre class="prettyprint"><code>For row in df: if row.Rank > 5: then replace(row.Id, "other") </code></pre> The result should look like this: <pre class="prettyprint"><code>+-----+----+ | Id|Rank| +-----+----+ | a| 5| |other| 7| |other| 8| | d| 1| +-----+----+ </code></pre> Any clue how to achieve this? Thanks!!! <hr> To create this Dataframe: <pre class="prettyprint"><code>df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank']) </code></pre>

You can use <code>when</code> and <code>otherwise</code> like - <pre class="prettyprint"><code>from pyspark.sql.functions import * df\ .withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\ .drop(df.Id)\ .select(col('Id_New').alias('Id'),col('Rank'))\ .show() </code></pre> this gives output as - <pre class="prettyprint"><code>+-----+----+ | Id|Rank| +-----+----+ | a| 5| |other| 7| |other| 8| | d| 1| +-----+----+ </code></pre>

PySpark: modify column values when another column value satisfies a condition

Tags:

apache-spark

apache-spark-sql

pyspark

I have a PySpark Dataframe with two columns:

+---+----+ | Id|Rank| +---+----+ |  a|   5| |  b|   7| |  c|   8| |  d|   1| +---+----+

For each row, I'm looking to replace Id column with "other" if Rank column is larger than 5.

If I use pseudocode to explain:

For row in df:   if row.Rank > 5:      then replace(row.Id, "other")

The result should look like this:

+-----+----+ |   Id|Rank| +-----+----+ |    a|   5| |other|   7| |other|   8| |    d|   1| +-----+----+

Any clue how to achieve this? Thanks!!!

To create this Dataframe:

df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank'])

774

asked May 15 '17 21:05

Yuehan Lyu

1 Answers

You can use when and otherwise like -

from pyspark.sql.functions import *  df\ .withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\ .drop(df.Id)\ .select(col('Id_New').alias('Id'),col('Rank'))\ .show()

this gives output as -

+-----+----+ |   Id|Rank| +-----+----+ |    a|   5| |other|   7| |other|   8| |    d|   1| +-----+----+

181

answered Sep 22 '22 10:09

Pushkr

Related questions
                            
                                How do I get a SQL row_number equivalent for a Spark RDD?
                            
                                Understanding spark physical plan
                            
                                AssertionError: col should be Column
                            
                                Encode and assemble multiple features in PySpark
                            
                                Condition in map function
                            
                                How to calculate sum and count in a single groupBy?
                            
                                How to create a udf in PySpark which returns an array of strings?
                            
                                Why does starting StreamingContext fail with “IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute”?
                            
                                Rolling your own reduceByKey in Spark Dataset
                            
                                In Apache Spark, why does RDD.union not preserve the partitioner?
                            
                                PySpark and broadcast join example
                            
                                Spark union column order
                            
                                How to find Spark's installation directory?
                            
                                Join two ordinary RDDs with/without Spark SQL
                            
                                Multiple condition filter on dataframe
                            
                                Left Anti join in Spark?
                            
                                SQL query in Spark/scala Size exceeds Integer.MAX_VALUE
                            
                                Why does Spark application fail with “ClassNotFoundException: Failed to find data source: kafka” as uber-jar with sbt assembly?
                            
                                Is it possible to alias columns programmatically in spark sql?
                            
                                How to add any new library like spark-csv in Apache Spark prebuilt version

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With