Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark Dataframe : comma to dot

I have imported data using comma in float numbers and I am wondering how can I 'convert' comma into dot. I am using pyspark dataframe so I tried this :

commaToDot = udf(lambda x : str(x).replace(',', '.'), FloatType())

myData.withColumn('area',commaToDot(myData.area))

And it definitely does not work. So can we replace directly it in dataframe from spark or should we switch in numpy type or something else ?

Thanks !

like image 716
fjcf1 Avatar asked May 17 '17 10:05

fjcf1


People also ask

How do I remove a comma from a string in PySpark?

Using the replace() function to replace comma with space in list in Python. In Python, we can replace substrings within a string using the replace() function. Using this function, we can replace comma with space in list in Python. It returns a new string with the substituted substring.

What is double type in PySpark?

DoubleType – A floating-point double value. IntegerType – An integer value. LongType – A long integer value. NullType – A null value. ShortType – A short integer value.

How do you replace a column with a comma in Python?

data["column_name"]=data["column_name"]. str. replace(',','. ')

How do you replace in PySpark?

By using PySpark SQL function regexp_replace() you can replace a column value with a string for another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string, the below example replace the street name Rd value with Road string on address column.


2 Answers

Another way to do it (without using UDFs) is:

myData = myData.withColumn('area', regexp_replace('area', ',', '.').cast('float'))
like image 187
Mara Avatar answered Sep 28 '22 01:09

Mara


I think you are missing

from pyspark.sql.types import FloatType

As Pushkr suggested udf with replace will give you back string column if you don't convert result to float

from pyspark import SQLContext
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("ReadCSV")
sc = SparkContext(conf=conf)
sqlctx = SQLContext(sc)
df = sqlctx.read.option("delimiter", ";").load("test.csv", format="csv")
df.show()
commaToDot = udf(lambda x : float(str(x).replace(',', '.')), FloatType())
df2=df.withColumn('area',commaToDot(df._c0))
df2.printSchema()
df2.show()

I used single column file , tested on spark 2.11/python 3.6

like image 36
zlidime Avatar answered Sep 28 '22 00:09

zlidime