How to remove special characters,unicode emojis in pyspark?

Question

Good afternoon everyone, I have a problem to clear special characters in a string column of the dataframe, I just want to remove special characters like html components, emojis and unicode errors, for example \u2013.

does anyone have an regular expression to help me? Or any suggestions on how to treat this problem?

input:

i want to remove 😃 and codes "\u2022"

expected output:

i want to remove and codes

I tried:

re.sub('[^A-Za-z0-9 \u2022]+', '', nome)

regexp_replace('nome', '
|/[\x00-\x1F\x7F]/u', ' ')

df = df.withColumn( "value_2", F.regexp_replace(F.regexp_replace("value", "[^\x00-\x7F]+", ""), '""', '') )

df = df.withColumn("new",df.text.encode('ascii', errors='ignore').decode('ascii'))

tried some solutions but none recognizes the character "\u2013", has anyone experienced this?

blackbishop · Accepted Answer

You can use this regex to remove all unicode caracters from the column with regexp_replace function. Then remove extra double quotes that can remain:

import pyspark.sql.functions as F

df = spark.createDataFrame([('i want to remove 😃 and codes "\u2022"',)], ["value"])

df = df.withColumn(
    "value_2",
    F.regexp_replace(F.regexp_replace("value", "[^\x00-\x7F]+", ""), '""', '')
)

df.show(truncate=False)

#+---------------------------------+----------------------------+
#|value                            |value_2                     |
#+---------------------------------+----------------------------+
#|i want to remove 😃 and codes "•"|i want to remove  and codes |
#+---------------------------------+----------------------------+

How to remove special characters,unicode emojis in pyspark?

Tags:

python

apache-spark

apache-spark-sql

pyspark

Carlos Eduardo Bilar Rodrigues

1 Answers

blackbishop

Recent Activity

Donate For Us

How to remove special characters,unicode emojis in pyspark?

Tags:

python

apache-spark

apache-spark-sql

pyspark

Carlos Eduardo Bilar Rodrigues

1 Answers

blackbishop

Related questions

Recent Activity

Donate For Us