I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -
from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import FloatType
df = spark.createDataFrame([('-1.269,75',)], ['revenue'])
df.show()
+---------+
| revenue|
+---------+
|-1.269,75|
+---------+
df.printSchema()
root
|-- revenue: string (nullable = true)
Output desired: df.show()
+---------+
| revenue|
+---------+
|-1269.75|
+---------+
df.printSchema()
root
|-- revenue: float (nullable = true)
I am using function regexp_replace
to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.
df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", "."))
df = df.withColumn('revenue', df['revenue'].cast("float"))
But, when I attempt replacing below, I get empty string. Why?? I was expecting -1269,75
.
df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
+-------+
|revenue|
+-------+
| |
+-------+
By using PySpark SQL function regexp_replace() you can replace a column value with a string for another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string, the below example replace the street name Rd value with Road string on address column.
In order to replace empty value with None/null on single DataFrame column, you can use withColumn() and when(). otherwise() function.
You need to escape .
to match it literally, as .
is a special character that matches almost any character in regex:
df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With