PySpark Dataframe : comma to dot

Tags:

I have imported data using comma in float numbers and I am wondering how can I 'convert' comma into dot. I am using pyspark dataframe so I tried this :

commaToDot = udf(lambda x : str(x).replace(',', '.'), FloatType())

myData.withColumn('area',commaToDot(myData.area))

And it definitely does not work. So can we replace directly it in dataframe from spark or should we switch in numpy type or something else ?

Thanks !

716

asked May 17 '17 10:05

fjcf1

2 Answers

Another way to do it (without using UDFs) is:

myData = myData.withColumn('area', regexp_replace('area', ',', '.').cast('float'))

187

answered Sep 28 '22 01:09

Mara

I think you are missing

from pyspark.sql.types import FloatType

As Pushkr suggested udf with replace will give you back string column if you don't convert result to float

from pyspark import SQLContext
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("ReadCSV")
sc = SparkContext(conf=conf)
sqlctx = SQLContext(sc)
df = sqlctx.read.option("delimiter", ";").load("test.csv", format="csv")
df.show()
commaToDot = udf(lambda x : float(str(x).replace(',', '.')), FloatType())
df2=df.withColumn('area',commaToDot(df._c0))
df2.printSchema()
df2.show()

I used single column file , tested on spark 2.11/python 3.6

answered Sep 28 '22 00:09

zlidime

Related questions
                            
                                Combine to Keras functional models
                            
                                Load Scraped Table via BS4 into Pandas Dataframe
                            
                                Writing a function that alternates plus and minus signs between list indices
                            
                                Is this the correct way of whitening an image in python?
                            
                                How to display all the months between given two dates?
                            
                                AttributeError: 'NoneType' object has no attribute '_root'
                            
                                How to create type hinting for a generic factory method?
                            
                                Python PPTX table.cell color
                            
                                Status code 500 not treated as exception
                            
                                importing JSON to mongoDB using pymongo
                            
                                How can I normalize colormap in matplotlib scatter plot?
                            
                                Cumulative (running) sum with django orm and postgresql
                            
                                python delete dict keys in list comprehension
                            
                                Python Pandas: Create new column out of other columns where value is not null
                            
                                Psycopg2: 'module' object has no attribute 'connect' [duplicate]
                            
                                Play an Animated GIF in python with tkinter
                            
                                ModuleNotFoundError: No module named 'selenium'
                            
                                Pandas-Add missing years in time series data with duplicate years
                            
                                Permutation without duplicates in Python
                            
                                Color points in scatter plot of Bokeh

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark Dataframe : comma to dot

Tags:

python

pyspark

spark-dataframe

fjcf1

People also ask

2 Answers

Mara

zlidime

Recent Activity

Donate For Us