I have a mixed type dataframe.
I am reading this dataframe from hive table using
spark.sql('select a,b,c from table')
command.
Some columns are int , bigint , double and others are string. There are 32 columns in total. Is there any way in pyspark to convert all columns in the data frame to string type ?
In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark. sql. functions.
Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame's columns to column-specific types.
You can get the column names from pandas DataFrame using df. columns. values , and pass this to python list() function to get it as list, once you have the data you can print it using print() statement.
In PySpark, you can cast or change the DataFrame column data type using cast() function of Column class, in this article, I will be using withColumn(), selectExpr() , and SQL expression to cast the from String to Int (Integer Type), String to Boolean e.t.c using PySpark examples.
Just:
from pyspark.sql.functions import col
table = spark.sql("table")
table.select([col(c).cast("string") for c in table.columns])
Here's a one line solution in Scala :
df.select(df.columns.map(c => col(c).cast(StringType)) : _*)
Let's see an example here :
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val data = Seq(
Row(1, "a"),
Row(5, "z")
)
val schema = StructType(
List(
StructField("num", IntegerType, true),
StructField("letter", StringType, true)
)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
df.printSchema
//root
//|-- num: integer (nullable = true)
//|-- letter: string (nullable = true)
val newDf = df.select(df.columns.map(c => col(c).cast(StringType)) : _*)
newDf.printSchema
//root
//|-- num: string (nullable = true)
//|-- letter: string (nullable = true)
I hope it helps
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With