There is no need for an UDF here. <code>Column</code> already provides <code>cast</code> method with <code>DataType</code> instance : <pre class="prettyprint lang-python prettyprint-override"><code>from pyspark.sql.types import DoubleType changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType())) </code></pre> or short string: <pre class="prettyprint lang-python prettyprint-override"><code>changedTypedf = joindf.withColumn("label", joindf["show"].cast("double")) </code></pre> where canonical string names (other variations can be supported as well) correspond to <code>simpleString</code> value. So for atomic types: <pre class="prettyprint"><code>from pyspark.sql import types for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 'LongType', 'ShortType', 'StringType', 'TimestampType']: print(f"{t}: {getattr(types, t)().simpleString()}") </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>BinaryType: binary BooleanType: boolean ByteType: tinyint DateType: date DecimalType: decimal(10,0) DoubleType: double FloatType: float IntegerType: int LongType: bigint ShortType: smallint StringType: string TimestampType: timestamp </code></pre> and for example complex types <pre class="prettyprint lang-py prettyprint-override"><code>types.ArrayType(types.IntegerType()).simpleString() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>'array<int>' </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>types.MapType(types.StringType(), types.IntegerType()).simpleString() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>'map<string,int>' </code></pre> Preserve the name of the column and avoid extra column addition by using the same name as input column: <pre class="prettyprint"><code>from pyspark.sql.types import DoubleType changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType())) </code></pre> Given answers are enough to deal with the problem but I want to share another way which may be introduced the new version of Spark (I am not sure about it) so given answer didn't catch it. We can reach the column in spark statement with <code>col("colum_name")</code> keyword: <pre class="prettyprint"><code>from pyspark.sql.functions import col changedTypedf = joindf.withColumn("show", col("show").cast("double")) </code></pre> PySpark version: <pre class="prettyprint"><code>df = <source data> df.printSchema() from pyspark.sql.types import * # Change column type df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType())) df_new.printSchema() df_new.select("myColumn").show() </code></pre> the solution was simple - <pre class="prettyprint"><code>toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType()) changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show'])) </code></pre>

How to change a dataframe column from String type to Double type in PySpark?

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

There is no need for an UDF here. Column already provides cast method with DataType instance :

from pyspark.sql.types import DoubleType

changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))

or short string:

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

where canonical string names (other variations can be supported as well) correspond to simpleString value. So for atomic types:

from pyspark.sql import types 

for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 
          'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 
           'LongType', 'ShortType', 'StringType', 'TimestampType']:
    print(f"{t}: {getattr(types, t)().simpleString()}")

BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp

and for example complex types

types.ArrayType(types.IntegerType()).simpleString()

'array<int>'

types.MapType(types.StringType(), types.IntegerType()).simpleString()

'map<string,int>'

Preserve the name of the column and avoid extra column addition by using the same name as input column:

from pyspark.sql.types import DoubleType
changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))

Given answers are enough to deal with the problem but I want to share another way which may be introduced the new version of Spark (I am not sure about it) so given answer didn't catch it.

We can reach the column in spark statement with col("colum_name") keyword:

from pyspark.sql.functions import col
changedTypedf = joindf.withColumn("show", col("show").cast("double"))

PySpark version:

df = <source data>
df.printSchema()

from pyspark.sql.types import *

# Change column type
df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
df_new.printSchema()
df_new.select("myColumn").show()

the solution was simple -

toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

Related questions
                            
                                How can I get all the request headers in Django?
                            
                                Is there a way to list pip dependencies/requirements?
                            
                                log messages appearing twice with Python Logging
                            
                                List comprehension rebinds names even after scope of comprehension. Is this right?
                            
                                Getting SyntaxError for print with keyword argument end=' '
                            
                                How is __eq__ handled in Python and in what order?
                            
                                How to surround selected text in PyCharm like with Sublime Text
                            
                                How to pip or easy_install tkinter on Windows
                            
                                Render HTML to PDF in Django site
                            
                                logging.info doesn't show up on console but warn and error do
                            
                                Tool to convert Python code to be PEP8 compliant
                            
                                print memory address of Python variable [duplicate]
                            
                                Get enumeration name by value [duplicate]
                            
                                How exactly does the python any() function work?
                            
                                Importing an ipynb file from another ipynb file?
                            
                                How to change the color of the axis, ticks and labels for a plot in matplotlib
                            
                                Matplotlib - How to plot a high resolution graph?
                            
                                Pairwise crossproduct in Python [duplicate]
                            
                                How do I calculate the MD5 checksum of a file in Python? [duplicate]
                            
                                How to avoid HTTP error 429 (Too Many Requests) python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With