Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to transform several columns to string in PySpark

It is well documented on SO (link 1, link 2, link 3, ...) how to transform a single variable to string type in PySpark by analogy:

from pyspark.sql.types import StringType    
spark_df = spark_df.withColumn('name_of_column', spark_df[name_of_column].cast(StringType()))

However, when you have several columns that you want transform to string type, there are several methods to achieve it:

Using for loops -- Successful approach in my code:

Trivial example:

to_str = ['age', 'weight', 'name', 'id']
for col in to_str:
  spark_df = spark_df.withColumn(col, spark_df[col].cast(StringType()))

which is a valid method but I believe not the optimal one that I am looking for.

Using list comprehensions -- Not succesful in my code:

My wrong example:

spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str))

Not succesful as I receive the error message:

TypeError: 'str' object is not callable

My question then would be: which would be the optimal way to transform several columns to string in PySpark based on a list of column names like to_str in my example?

Thanks in advance for your advice.

POSTERIOR CLARIFICATION EDIT:

Thanks to @Rumoku and @pault feedback:

Both code lines are correct:

spark_df = spark_df.select(*(col(c).cast("string").alias(c) for c in to_str)) # My initial list comprehension expression is correct.

and

spark_df = spark_df.select([col(c).cast(StringType()).alias(c) for c in to_str]) # Initial answer proposed by @Rumoku is correct.

I was receiving the error messages from PySpark given that I previously changed the name of the object to_str for col. As @pault explains: col (the list with the desired string variables) had the same name as the function col of the list comprehension, that´s why PySpark complained. Simply renaming col to to_str, and updating spark-notebook fixed everything.

like image 463
NuValue Avatar asked Sep 17 '25 23:09

NuValue


1 Answers

It should be:

spark_df = spark_df.select([col(c).cast(StringType()).alias(c) for c in to_str])
like image 63
vvg Avatar answered Sep 20 '25 14:09

vvg