Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark removing multiple characters in a dataframe column

Looking at pyspark, I see translate and regexp_replace to help me a single characters that exists in a dataframe column.

I was wondering if there is a way to supply multiple strings in the regexp_replace or translate so that it would parse them and replace them with something else.

Use case: remove all $, #, and comma(,) in a column A

like image 301
E B Avatar asked Jun 08 '18 18:06

E B


People also ask

How do I remove a character from a column in PySpark?

Spark SQL function regex_replace can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary.

How do you replace a character in a column in PySpark?

By using PySpark SQL function regexp_replace() you can replace a column value with a string for another string/substring.

What does trim function do in PySpark?

The trim() function The trim() function 'trims' spaces before and after the column string values, there's some variations of this function called ltrim() that removes spaces on the left side of the string and rtrim() that removes spaces on the right side of it.


2 Answers

You can use pyspark.sql.functions.translate() to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.

For example, let's say you had the following DataFrame:

import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#|                 A|
#+------------------+
#|           $100,00|
#|           #foobar|
#|foo, bar, #, and $|
#+------------------+

and wanted to replace ('$', '#', ',') with ('X', 'Y', 'Z'). Simply use translate like:

df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#|                 A|          replaced|
#+------------------+------------------+
#|           $100,00|           X100Z00|
#|           #foobar|           Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+

If instead you wanted to remove all instances of ('$', '#', ','), you could do this with pyspark.sql.functions.regexp_replace().

df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#|                 A|     replaced|
#+------------------+-------------+
#|           $100,00|        10000|
#|           #foobar|       foobar|
#|foo, bar, #, and $|foo bar  and |
#+------------------+-------------+

The pattern "[\$#,]" means match any of the characters inside the brackets. The $ has to be escaped because it has a special meaning in regex.

like image 126
pault Avatar answered Oct 21 '22 19:10

pault


If someone need to do this in scala you can do this as below code:

val df = Seq(("Test$",19),("$#,",23),("Y#a",20),("ZZZ,,",21)).toDF("Name","age")
import org.apache.spark.sql.functions._
val df1 = df.withColumn("NewName",translate($"Name","$#,","xyz"))
display(df1)

You can see the output as below: enter image description here

like image 38
Nikunj Kakadiya Avatar answered Oct 21 '22 17:10

Nikunj Kakadiya