I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa - <pre class="prettyprint"><code>from pyspark.sql.functions import regexp_replace,col from pyspark.sql.types import FloatType df = spark.createDataFrame([('-1.269,75',)], ['revenue']) df.show() +---------+ | revenue| +---------+ |-1.269,75| +---------+ df.printSchema() root |-- revenue: string (nullable = true) </code></pre> Output desired: df.show() <pre class="prettyprint"><code>+---------+ | revenue| +---------+ |-1269.75| +---------+ df.printSchema() root |-- revenue: float (nullable = true) </code></pre> I am using function <code>regexp_replace</code> to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType. <pre class="prettyprint"><code>df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", "")) df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", ".")) df = df.withColumn('revenue', df['revenue'].cast("float")) </code></pre> But, when I attempt replacing below, I get empty string. Why?? I was expecting <code>-1269,75</code>. <pre class="prettyprint"><code>df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", "")) +-------+ |revenue| +-------+ | | +-------+ </code></pre>

You need to escape <code>.</code> to match it literally, as <code>.</code> is a special character that matches almost any character in regex: <pre class="prettyprint"><code>df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", "")) </code></pre>

Replace string in PySpark

Tags:

python

replace

dataframe

pyspark

I am having a dataframe, with numbers in European format, which I imported as a String. Comma as decimal and vice versa -

from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import FloatType
df = spark.createDataFrame([('-1.269,75',)], ['revenue'])
df.show()
+---------+
|  revenue|
+---------+
|-1.269,75|
+---------+
df.printSchema()
root
 |-- revenue: string (nullable = true)

Output desired: df.show()

+---------+
|  revenue|
+---------+
|-1269.75|
+---------+
df.printSchema()
root
 |-- revenue: float (nullable = true)

I am using function regexp_replace to first replace dot with empty space - then replace comma with empty dot and finally cast into floatType.

df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
df = df.withColumn('revenue', regexp_replace(col('revenue'), ",", "."))
df = df.withColumn('revenue', df['revenue'].cast("float"))

But, when I attempt replacing below, I get empty string. Why?? I was expecting -1269,75.

df = df.withColumn('revenue', regexp_replace(col('revenue'), ".", ""))
+-------+
|revenue|
+-------+
|       |
+-------+

450

asked Oct 31 '18 16:10

cph_sto

1 Answers

You need to escape . to match it literally, as . is a special character that matches almost any character in regex:

df = df.withColumn('revenue', regexp_replace(col('revenue'), "\\.", ""))

182

answered Oct 13 '22 22:10

Psidom

Related questions
                            
                                Date slider with Plotly Dash does not work
                            
                                using Case in django
                            
                                Python: Pandas Concatenate each row into a string
                            
                                How do I change a button label created with 'interact_manual' from 'ipywidgets'? and how do I change the size and color of that button?
                            
                                Converting a column of minutes to hours and minutes python
                            
                                Keras, TensorFlow : "TypeError: Cannot interpret feed_dict key as Tensor"
                            
                                joblib parallel processing of a multiple return values function
                            
                                Remove top row from a dataframe
                            
                                Find the similarity between two string columns of a DataFrame
                            
                                propagate conditional column value in pandas
                            
                                Pandas to_sql() to update unique values in DB?
                            
                                How to filter logs from gunicorn?
                            
                                Paho MQTT Python Client: No exceptions thrown, just stops
                            
                                Finding maximum weighted edge in a networkx graph in python
                            
                                Why can I repeat the + in Python arbitrarily in a calculation?
                            
                                Numpy Random Choice not working for 2-dimentional list
                            
                                Correct way to use GeoPy Nominatim
                            
                                How to implement "positional-only parameter" in a user defined function in python?
                            
                                Create pandas dataframe from string (in csv format)
                            
                                Perspective transform with Python PIL using src / target coordinates

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With