How to convert DataFrame columns from string to float/double in PySpark 1.6?

Question

In PySpark 1.6 DataFrame currently there is no Spark builtin function to convert from string to float/double.

Assume, we have a RDD with ('house_name', 'price') with both values as string. You would like to convert, price from string to float. In PySpark, we can apply map and python float function to achieve this.

New_RDD = RawDataRDD.map(lambda (house_name, price): (house_name, float(x.price))    # this works

In PySpark 1.6 Dataframe, it does not work:

New_DF = rawdataDF.select('house name', float('price')) # did not work

Until a built in Pyspark function available, how to do achieve this conversion with a UDF? I developed this conversion UDF as follows:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def string_to_float(x):
    return float(x)

udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("house name", udfstring_to_float("price"))

Is there a better and much simpler way to achieve the same?

Alex · Accepted Answer

According to the documentation, you can use the cast function on a column like this:

rawdata.withColumn("house name", rawdata["price"].cast(DoubleType()).alias("price"))

How to convert DataFrame columns from string to float/double in PySpark 1.6?

Tags:

python

type-conversion

apache-spark-sql

pyspark

Sohel Khan

1 Answers

Alex

Recent Activity

Donate For Us

How to convert DataFrame columns from string to float/double in PySpark 1.6?

Tags:

python

type-conversion

apache-spark-sql

pyspark

Sohel Khan

1 Answers

Alex

Related questions

Recent Activity

Donate For Us