Looking at pyspark, I see translate
and regexp_replace
to help me a single characters that exists in a dataframe column.
I was wondering if there is a way to supply multiple strings in the regexp_replace
or translate
so that it would parse them and replace them with something else.
Use case: remove all $, #, and comma(,) in a column A
Spark SQL function regex_replace can be used to remove special characters from a string column in Spark DataFrame. Depends on the definition of special characters, the regular expressions can vary.
By using PySpark SQL function regexp_replace() you can replace a column value with a string for another string/substring.
The trim() function The trim() function 'trims' spaces before and after the column string values, there's some variations of this function called ltrim() that removes spaces on the left side of the string and rtrim() that removes spaces on the right side of it.
You can use pyspark.sql.functions.translate()
to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.
For example, let's say you had the following DataFrame:
import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#| A|
#+------------------+
#| $100,00|
#| #foobar|
#|foo, bar, #, and $|
#+------------------+
and wanted to replace ('$', '#', ',')
with ('X', 'Y', 'Z')
. Simply use translate
like:
df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#| A| replaced|
#+------------------+------------------+
#| $100,00| X100Z00|
#| #foobar| Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+
If instead you wanted to remove all instances of ('$', '#', ',')
, you could do this with pyspark.sql.functions.regexp_replace()
.
df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#| A| replaced|
#+------------------+-------------+
#| $100,00| 10000|
#| #foobar| foobar|
#|foo, bar, #, and $|foo bar and |
#+------------------+-------------+
The pattern "[\$#,]"
means match any of the characters inside the brackets. The $
has to be escaped because it has a special meaning in regex.
If someone need to do this in scala you can do this as below code:
val df = Seq(("Test$",19),("$#,",23),("Y#a",20),("ZZZ,,",21)).toDF("Name","age")
import org.apache.spark.sql.functions._
val df1 = df.withColumn("NewName",translate($"Name","$#,","xyz"))
display(df1)
You can see the output as below:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With