Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace string in PySpark

I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -

from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import FloatType
df = spark.createDataFrame([('-1.269,75',)], ['revenue'])
df.show()
+---------+
|  revenue|
+---------+
|-1.269,75|
+---------+
df.printSchema()
root
 |-- revenue: string (nullable = true)

Output desired: df.show()

+---------+
|  revenue|
+---------+
|-1269.75|
+---------+
df.printSchema()
root
 |-- revenue: float (nullable = true)

I am using function regexp_replace to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.

df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", "."))
df = df.withColumn('revenue', df['revenue'].cast("float"))

But, when I attempt replacing below, I get empty string. Why?? I was expecting -1269,75.

df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
+-------+
|revenue|
+-------+
|       |
+-------+
like image 450
cph_sto Avatar asked Oct 31 '18 16:10

cph_sto


People also ask

How do you replace a string in PySpark?

By using PySpark SQL function regexp_replace() you can replace a column value with a string for another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string, the below example replace the street name Rd value with Road string on address column.

How do I replace a string value with a NULL in PySpark?

In order to replace empty value with None/null on single DataFrame column, you can use withColumn() and when(). otherwise() function.


1 Answers

You need to escape . to match it literally, as . is a special character that matches almost any character in regex:

df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))
like image 182
Psidom Avatar answered Oct 13 '22 22:10

Psidom