How do I replace a string value with a NULL in PySpark?

Tags:

I want to do something like this:

df.replace('empty-value', None, 'NAME')

Basically, I want to replace some value with NULL. but it does not accept None in this function. How can I do this?

391

asked Apr 27 '16 18:04

talloaktrees

2 Answers

You can combine when clause with NULL literal and types casting as follows:

from pyspark.sql.functions import when, lit, col  df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])  def replace(column, value):     return when(column != value, column).otherwise(lit(None))  df.withColumn("y", replace(col("y"), "bar")).show() ## +---+----+ ## |  x|   y| ## +---+----+ ## |  1| foo| ## |  2|null| ## +---+----+

It doesn't introduce BatchPythonEvaluation and because of that should be significantly more efficient than using an UDF.

191

answered Sep 17 '22 17:09

zero323

This will replace empty-value with None in your name column:

from pyspark.sql.functions import udf from pyspark.sql.types import StringType   df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"]) new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType()) new_df = df.withColumn("name", new_column_udf(df.name)) new_df.collect()

Output:

[Row(key=1, name=None), Row(key=2, name=u'something else')]

By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.

answered Sep 19 '22 17:09

Daniel Zolnai

Related questions
                            
                                PySpark: modify column values when another column value satisfies a condition
                            
                                environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
                            
                                How to write the resulting RDD to a csv file in Spark python
                            
                                How to configure high performance BLAS/LAPACK for Breeze on Amazon EMR, EC2
                            
                                How does Spark running on YARN account for Python memory usage?
                            
                                How to define schema for custom type in Spark SQL?
                            
                                How to pivot on multiple columns in Spark SQL?
                            
                                Spark: Efficient way to test if an RDD is empty
                            
                                Save content of Spark DataFrame as a single CSV file [duplicate]
                            
                                Passing Array to Spark Lit function
                            
                                Triggering spark jobs with REST
                            
                                Why is Apache-Spark - Python so slow locally as compared to pandas?
                            
                                PySpark Drop Rows
                            
                                Retrieve SparkContext from SparkSession
                            
                                java.lang.ClassCastException using lambda expressions in spark job on remote server
                            
                                How to use orderby() with descending order in Spark window functions?
                            
                                Exploding nested Struct in Spark dataframe
                            
                                How to create a sample single-column Spark DataFrame in Python?
                            
                                How does Distinct() function work in Spark?
                            
                                How to replace null values with a specific value in Dataframe using spark in Java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I replace a string value with a NULL in PySpark?

Tags:

null

dataframe

apache-spark

pyspark

talloaktrees

People also ask

2 Answers

zero323

Daniel Zolnai

Recent Activity

Donate For Us