Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I replace a string value with a NULL in PySpark?

I want to do something like this:

df.replace('empty-value', None, 'NAME') 

Basically, I want to replace some value with NULL. but it does not accept None in this function. How can I do this?

like image 391
talloaktrees Avatar asked Apr 27 '16 18:04

talloaktrees


People also ask

How do you give NULL values in PySpark?

In order to replace empty value with None/null on single DataFrame column, you can use withColumn() and when(). otherwise() function.

How do I change a value to NULL in Spark DataFrame?

The replacement of null values in PySpark DataFrames is one of the most common operations undertaken. This can be achieved by using either DataFrame. fillna() or DataFrameNaFunctions. fill() methods.

How do I change string values in PySpark?

By using PySpark SQL function regexp_replace() you can replace a column value with a string for another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string, the below example replace the street name Rd value with Road string on address column.


2 Answers

You can combine when clause with NULL literal and types casting as follows:

from pyspark.sql.functions import when, lit, col  df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])  def replace(column, value):     return when(column != value, column).otherwise(lit(None))  df.withColumn("y", replace(col("y"), "bar")).show() ## +---+----+ ## |  x|   y| ## +---+----+ ## |  1| foo| ## |  2|null| ## +---+----+ 

It doesn't introduce BatchPythonEvaluation and because of that should be significantly more efficient than using an UDF.

like image 191
zero323 Avatar answered Sep 17 '22 17:09

zero323


This will replace empty-value with None in your name column:

from pyspark.sql.functions import udf from pyspark.sql.types import StringType   df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"]) new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType()) new_df = df.withColumn("name", new_column_udf(df.name)) new_df.collect() 

Output:

[Row(key=1, name=None), Row(key=2, name=u'something else')] 

By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.

like image 45
Daniel Zolnai Avatar answered Sep 19 '22 17:09

Daniel Zolnai