Remove special character from a column in dataframe

Question

I am trying to remove a special character (å) from a column in a dataframe.

My data looks like:

ClientID,PatientID 
AR0001å,DH_HL704221157198295_91
AR00022,DH_HL704221157198295_92

My original data is approx 8TB in size from which I need to get rid of this special character.

Code to load data:

reader.option("header", true)
  .option("sep", ",")
  .option("inferSchema", false)
  .option("charset", "ISO-8859-1")
  .schema(schema)
  .csv(path)

After loading into dataframe when I do df.show() it shows:

+--------+--------------------+
|ClientID|           PatientID|
+--------+--------------------+
|AR0001Ã¥|DH_HL704221157198...|
|AR00022 |DH_HL704221157198...|
+--------+--------------------+

Code I used to try to replace this character:

df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "\å", ""));

But this didn't work. While loading the data in dataframe if I change the charset to "UTF-8" it works.

I am not able to find a solution with the current charset (ISO-8859-1).

Aditya Gupta · Accepted Answer

The below command will remove all the special characters and will keep all the lower/upper case alphabets and all the numbers in the string:

df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "[^a-zA-Z0-9]", ""));

Shaido · Answer

Some things to note,

Make sure to assign the result to a new variable and use that afterwards
You do not need to escape "å" with \
colName in the command should be ClientId or PatientID

If you did all these things, then I would suggest to, instead of matching on "å", try matching on the characters you want to keep. For example, for the ClientID column,

df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "[^A-Z0-9_]", ""));

Another approach would be to convert the UTF-8 character "å" to it's ISO-8859-1 equivalent and replace with the resulting string.

String escapeChar = new String("å".getBytes("UTF-8"), "ISO-8859-1");

Remove special character from a column in dataframe

Tags:

java

character-encoding

csv

apache-spark

apache-spark-sql

abhiadh

2 Answers

Aditya Gupta

Shaido

Recent Activity

Donate For Us

Remove special character from a column in dataframe

Tags:

java

character-encoding

csv

apache-spark

apache-spark-sql

abhiadh

2 Answers

Aditya Gupta

Shaido

Related questions

Recent Activity

Donate For Us