Looking at pyspark, I see <code>translate</code> and <code>regexp_replace</code> to help me a single characters that exists in a dataframe column. I was wondering if there is a way to supply multiple strings in the <code>regexp_replace</code> or <code>translate</code> so that it would parse them and replace them with something else. Use case: remove all $, #, and comma(,) in a column A

You can use <code>pyspark.sql.functions.translate()</code> to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values. For example, let's say you had the following DataFrame: <pre class="prettyprint lang-python prettyprint-override"><code>import pyspark.sql.functions as f df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"]) df.show() #+------------------+ #| A| #+------------------+ #| $100,00| #| #foobar| #|foo, bar, #, and $| #+------------------+ </code></pre> and wanted to replace <code>('$', '#', ',')</code> with <code>('X', 'Y', 'Z')</code>. Simply use <code>translate</code> like: <pre class="prettyprint lang-python prettyprint-override"><code>df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show() #+------------------+------------------+ #| A| replaced| #+------------------+------------------+ #| $100,00| X100Z00| #| #foobar| Yfoobar| #|foo, bar, #, and $|fooZ barZ YZ and X| #+------------------+------------------+ </code></pre> If instead you wanted to remove all instances of <code>('$', '#', ',')</code>, you could do this with <code>pyspark.sql.functions.regexp_replace()</code>. <pre class="prettyprint lang-python prettyprint-override"><code>df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show() #+------------------+-------------+ #| A| replaced| #+------------------+-------------+ #| $100,00| 10000| #| #foobar| foobar| #|foo, bar, #, and $|foo bar and | #+------------------+-------------+ </code></pre> The pattern <code>"[\$#,]"</code> means match any of the characters inside the brackets. The <code>$</code> has to be escaped because it has a special meaning in regex.

Pyspark removing multiple characters in a dataframe column

2 Answers

You can use pyspark.sql.functions.translate() to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.

For example, let's say you had the following DataFrame:

import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#|                 A|
#+------------------+
#|           $100,00|
#|           #foobar|
#|foo, bar, #, and $|
#+------------------+

and wanted to replace ('$', '#', ',') with ('X', 'Y', 'Z'). Simply use translate like:

df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#|                 A|          replaced|
#+------------------+------------------+
#|           $100,00|           X100Z00|
#|           #foobar|           Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+

If instead you wanted to remove all instances of ('$', '#', ','), you could do this with pyspark.sql.functions.regexp_replace().

df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#|                 A|     replaced|
#+------------------+-------------+
#|           $100,00|        10000|
#|           #foobar|       foobar|
#|foo, bar, #, and $|foo bar  and |
#+------------------+-------------+

The pattern "[\$#,]" means match any of the characters inside the brackets. The $ has to be escaped because it has a special meaning in regex.

126

answered Oct 21 '22 19:10

pault

If someone need to do this in scala you can do this as below code:

val df = Seq(("Test$",19),("$#,",23),("Y#a",20),("ZZZ,,",21)).toDF("Name","age")
import org.apache.spark.sql.functions._
val df1 = df.withColumn("NewName",translate($"Name","$#,","xyz"))
display(df1)

You can see the output as below: enter image description here

answered Oct 21 '22 17:10

Nikunj Kakadiya

Related questions
                            
                                Pyspark: TaskMemoryManager: Failed to allocate a page: Need help in Error Analysis
                            
                                Get Last Monday in Spark
                            
                                pyspark; check if an element is in collect_list [duplicate]
                            
                                Create Spark DataFrame from Pandas DataFrame
                            
                                Read ORC files directly from Spark shell
                            
                                How can I change SparkContext.sparkUser() setting (in pyspark)?
                            
                                what is the most efficient way in pyspark to reduce a dataframe?
                            
                                Emit multiple pairs in map operation
                            
                                Error ExecutorLostFailure when running a task in Spark
                            
                                Missing SPARK_HOME when using SparkLauncher on AWS EMR cluster
                            
                                How to skip lines while reading a CSV file as a dataFrame using PySpark?
                            
                                reading json file in pyspark
                            
                                If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?
                            
                                Pyspark changing type of column from date to string
                            
                                How to add my own function as a custom stage in a ML pyspark Pipeline? [duplicate]
                            
                                How to get rows from DF that contain value None in pyspark (spark)
                            
                                What does Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED mean in pyspark?
                            
                                Difference between RDD.foreach() and RDD.map()
                            
                                Pyspark filter using startswith from list
                            
                                How to Sort a Dataframe in Pyspark [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark removing multiple characters in a dataframe column

Tags:

pyspark

translate

regexp-replace

E B

People also ask

2 Answers

pault

Nikunj Kakadiya

Recent Activity

Donate For Us