Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to cast all columns of dataframe to string

I have a mixed type dataframe. I am reading this dataframe from hive table using spark.sql('select a,b,c from table') command.

Some columns are int , bigint , double and others are string. There are 32 columns in total. Is there any way in pyspark to convert all columns in the data frame to string type ?

like image 670
user1411335 Avatar asked Feb 07 '17 02:02

user1411335


People also ask

How do I convert all columns to string in PySpark?

In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark. sql. functions.

How do I cast columns in pandas?

Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame's columns to column-specific types.

How do I get a list of columns in a DataFrame?

You can get the column names from pandas DataFrame using df. columns. values , and pass this to python list() function to get it as list, once you have the data you can print it using print() statement.

How do I cast a column in PySpark DataFrame?

In PySpark, you can cast or change the DataFrame column data type using cast() function of Column class, in this article, I will be using withColumn(), selectExpr() , and SQL expression to cast the from String to Int (Integer Type), String to Boolean e.t.c using PySpark examples.


2 Answers

Just:

from pyspark.sql.functions import col

table = spark.sql("table")

table.select([col(c).cast("string") for c in table.columns])
like image 91
user7526416 Avatar answered Oct 27 '22 18:10

user7526416


Here's a one line solution in Scala :

df.select(df.columns.map(c => col(c).cast(StringType)) : _*)

Let's see an example here :

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val data = Seq(
   Row(1, "a"),
   Row(5, "z")
)

val schema = StructType(
  List(
    StructField("num", IntegerType, true),
    StructField("letter", StringType, true)
 )
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

df.printSchema
//root
//|-- num: integer (nullable = true)
//|-- letter: string (nullable = true)

val newDf = df.select(df.columns.map(c => col(c).cast(StringType)) : _*)

newDf.printSchema
//root
//|-- num: string (nullable = true)
//|-- letter: string (nullable = true)

I hope it helps

like image 31
mahmoud mehdi Avatar answered Oct 27 '22 19:10

mahmoud mehdi