Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change schema of existing dataframe

I want to change schema of existing dataframe,while changing the schema I'm experiencing error.Is it possible I can change the existing schema of a dataframe.

val customSchema=StructType(
      Array(
        StructField("data_typ", StringType, nullable=false),
        StructField("data_typ", IntegerType, nullable=false),
        StructField("proc_date", IntegerType, nullable=false),
        StructField("cyc_dt", DateType, nullable=false),
        ));

val readDF=
+------------+--------------------+-----------+--------------------+
|DatatypeCode|         Description|monthColNam|     timeStampColNam|
+------------+--------------------+-----------+--------------------+
|       03099|Volumetric/Expand...|     201867|2018-05-31 18:25:...|
|       03307|  Elapsed Day Factor|     201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+

val rows= readDF.rdd
val readDF1 = sparkSession.createDataFrame(rows,customSchema)

expected result

val newdf=
    +------------+--------------------+-----------+--------------------+
    |data_typ_cd |       data_typ_desc|proc_dt    |     cyc_dt         |
    +------------+--------------------+-----------+--------------------+
    |       03099|Volumetric/Expand...|     201867|2018-05-31 18:25:...|
    |       03307|  Elapsed Day Factor|     201867|2018-05-31 18:25:...|
    +------------+--------------------+-----------+--------------------+

Any help will be appricated

like image 521
user9318576 Avatar asked May 31 '18 13:05

user9318576


People also ask

How do I change the schema type in PySpark DataFrame?

Method 1: Using DataFrame.withColumn() The DataFrame. withColumn(colName, col) returns a new DataFrame by adding a column or replacing the existing column that has the same name. We will make use of cast(x, dataType) method to casts the column to a different data type.

How do you set a schema of a DataFrame in PySpark?

1.1 Using toDF() function PySpark printschema() yields the schema of the DataFrame to console. If you wanted to provide column names to the DataFrame use toDF() method with column names as arguments as shown below. By default, the datatype of these columns infers to the type of data.


2 Answers

You can do something like this to change the datatype from one to other.

I have created a dataframe similar to yours like below:

import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.types._

var df = Seq(("03099","Volumetric/Expand...", "201867", "2018-05-31 18:25:00"),("03307","Elapsed Day Factor", "201867", "2018-05-31 18:25:00"))
  .toDF("DatatypeCode","data_typ", "proc_date", "cyc_dt")

df.printSchema()
df.show()

This gives me the following output:

root
 |-- DatatypeCode: string (nullable = true)
 |-- data_typ: string (nullable = true)
 |-- proc_date: string (nullable = true)
 |-- cyc_dt: string (nullable = true)

+------------+--------------------+---------+-------------------+
|DatatypeCode|            data_typ|proc_date|             cyc_dt|
+------------+--------------------+---------+-------------------+
|       03099|Volumetric/Expand...|   201867|2018-05-31 18:25:00|
|       03307|  Elapsed Day Factor|   201867|2018-05-31 18:25:00|
+------------+--------------------+---------+-------------------+

If you see the schema above all the columns are of type String. Now I want to change the column proc_date to Integer type and cyc_dt to Date type, I will do the following:

df = df.withColumnRenamed("DatatypeCode", "data_type_code")

df = df.withColumn("proc_date_new", df("proc_date").cast(IntegerType)).drop("proc_date")

df = df.withColumn("cyc_dt_new", df("cyc_dt").cast(DateType)).drop("cyc_dt")

and when you check the schema of this dataframe

df.printSchema()

then it gives the output as following with the new column names:

root
 |-- data_type_code: string (nullable = true)
 |-- data_typ: string (nullable = true)
 |-- proc_date_new: integer (nullable = true)
 |-- cyc_dt_new: date (nullable = true)
like image 72
Prasad Khode Avatar answered Oct 24 '22 22:10

Prasad Khode


You cannot change schema like this. Schema object passed to createDataFrame has to match the data, not the other way around:

  • To parse timestamp data use corresponding functions, for example like Better way to convert a string field into timestamp in Spark
  • To change other types use cast method, for example how to change a Dataframe column from String type to Double type in pyspark
like image 29
user9876218 Avatar answered Oct 24 '22 20:10

user9876218