Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to change pyspark data frame column data type?

I'm looking for method to change pyspark dataframe column type

from

df.printSchema()

enter image description here

To

enter image description here

Thank you, for your help, in advance.

like image 819
user2763088 Avatar asked Oct 19 '25 14:10

user2763088


1 Answers

You have to replace the column with new schema. ArrayType take two parameters elementType and containsNull.

from pyspark.sql.types import *
from pyspark.sql.functions import udf
x = [("a",["b","c","d","e"]),("g",["h","h","d","e"])]
schema = StructType([StructField("key",StringType(), nullable=True),
                     StructField("values", ArrayType(StringType(), containsNull=False))])

df = spark.createDataFrame(x,schema = schema)
df.printSchema()
new_schema = ArrayType(StringType(), containsNull=True)
udf_foo = udf(lambda x:x, new_schema)
df.withColumn("values",udf_foo("values")).printSchema()



root
 |-- key: string (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: string (containsNull = false)

root
 |-- key: string (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: string (containsNull = true)
like image 96
pauli Avatar answered Oct 21 '25 08:10

pauli