I am trying to remove a special character (å) from a column in a dataframe.
My data looks like:
ClientID,PatientID
AR0001å,DH_HL704221157198295_91
AR00022,DH_HL704221157198295_92
My original data is approx 8TB in size from which I need to get rid of this special character.
Code to load data:
reader.option("header", true)
.option("sep", ",")
.option("inferSchema", false)
.option("charset", "ISO-8859-1")
.schema(schema)
.csv(path)
After loading into dataframe when I do df.show()
it shows:
+--------+--------------------+
|ClientID| PatientID|
+--------+--------------------+
|AR0001Ã¥|DH_HL704221157198...|
|AR00022 |DH_HL704221157198...|
+--------+--------------------+
Code I used to try to replace this character:
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "\å", ""));
But this didn't work. While loading the data in dataframe if I change the charset to "UTF-8" it works.
I am not able to find a solution with the current charset (ISO-8859-1).
The below command will remove all the special characters and will keep all the lower/upper case alphabets and all the numbers in the string:
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "[^a-zA-Z0-9]", ""));
Some things to note,
\
colName
in the command should be ClientId
or PatientID
If you did all these things, then I would suggest to, instead of matching on "å", try matching on the characters you want to keep. For example, for the ClientID
column,
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "[^A-Z0-9_]", ""));
Another approach would be to convert the UTF-8 character "å" to it's ISO-8859-1 equivalent and replace with the resulting string.
String escapeChar = new String("å".getBytes("UTF-8"), "ISO-8859-1");
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With