How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?

Tags:

import numpy as np  df = spark.createDataFrame(     [(1, 1, None),      (1, 2, float(5)),      (1, 3, np.nan),      (1, 4, None),      (0, 5, float(10)),      (1, 6, float('nan')),      (0, 6, float('nan'))],     ('session', "timestamp1", "id2"))

+-------+----------+----+ |session|timestamp1| id2| +-------+----------+----+ |      1|         1|null| |      1|         2| 5.0| |      1|         3| NaN| |      1|         4|null| |      0|         5|10.0| |      1|         6| NaN| |      0|         6| NaN| +-------+----------+----+

How to replace value of timestamp1 column with value 999 when session==0?

Expected output

+-------+----------+----+ |session|timestamp1| id2| +-------+----------+----+ |      1|         1|null| |      1|         2| 5.0| |      1|         3| NaN| |      1|         4|null| |      0|       999|10.0| |      1|         6| NaN| |      0|       999| NaN| +-------+----------+----+

Is it possible to do it using replace() in PySpark?

692

asked Jun 27 '17 06:06

GeorgeOfTheRF

1 Answers

You should be using the when (with otherwise) function:

from pyspark.sql.functions import when  targetDf = df.withColumn("timestamp1", \               when(df["session"] == 0, 999).otherwise(df["timestamp1"]))

155

answered Oct 13 '22 01:10

Assaf Mendelson

Related questions
                            
                                Read files sent with spark-submit by the driver
                            
                                How to run Spark code in Airflow?
                            
                                Apache Spark Moving Average
                            
                                What are the Spark transformations that causes a Shuffle?
                            
                                How to set hadoop configuration values from pyspark
                            
                                Add column sum as new column in PySpark dataframe
                            
                                Count number of non-NaN entries in each column of Spark dataframe with Pyspark
                            
                                Spark union of multiple RDDs
                            
                                How to set amount of Spark executors?
                            
                                How to build a sparkSession in Spark 2.0 using pyspark?
                            
                                Aggregating multiple columns with custom function in Spark
                            
                                Specifying the filename when saving a DataFrame as a CSV [duplicate]
                            
                                Calling Java/Scala function from a task
                            
                                Getting the count of records in a data frame quickly
                            
                                pyspark: rolling average using timeseries data
                            
                                Where do you need to use lit() in Pyspark SQL?
                            
                                Spark on yarn concept understanding
                            
                                Is there better way to display entire Spark SQL DataFrame?
                            
                                PySpark row-wise function composition
                            
                                SPARK SQL - case when then

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to conditionally replace value in a column based on evaluation of expression based on another column in Pyspark?

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

GeorgeOfTheRF

People also ask

1 Answers

Assaf Mendelson

Recent Activity

Donate For Us