Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pass schema to create a new Dataframe from existing Dataframe?

To pass schema to a json file we do this:

from pyspark.sql.types import (StructField, StringType, StructType, IntegerType)
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields = data_schema)
df =spark.read.json('people.json', schema=final_struc)

The above code works as expected. However now, I have data in table which I display by:

df = sqlContext.sql("SELECT * FROM people_json")               

But if I try to pass a new schema to it by using following command it does not work.

df2 = spark.sql("SELECT * FROM people_json", schema=final_struc)

It gives the following error:

sql() got an unexpected keyword argument 'schema'

NOTE: I am using Databrics Community Edition

  • What am I missing?
  • How do I pass the new schema if I have data in the table instead of some JSON file?
like image 296
BlackBeard Avatar asked Feb 12 '18 04:02

BlackBeard


People also ask

How do I create a new DataFrame from an existing DataFrame in PySpark?

To create a PySpark DataFrame from an existing RDD, we will first create an RDD using the . parallelize() method and then convert it into a PySpark DataFrame using the . createDatFrame() method of SparkSession.

How do I create a DataFrame from an existing DataFrame column?

Using DataFrame.Select the columns from the original DataFrame and copy it to create a new DataFrame using copy() function. Yields below output. Alternatively, You can also use DataFrame. filter() method to create a copy and create a new DataFrame by selecting specific columns.

How do you apply a schema to a data frame?

We can create a DataFrame programmatically using the following three steps. Create an RDD of Rows from an Original RDD. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.


1 Answers

You cannot apply a new schema to already created dataframe. However, you can change the schema of each column by casting to another datatype as below.

df.withColumn("column_name", $"column_name".cast("new_datatype"))

If you need to apply a new schema, you need to convert to RDD and create a new dataframe again as below

df = sqlContext.sql("SELECT * FROM people_json")
val newDF = spark.createDataFrame(df.rdd, schema=schema)

Hope this helps!

like image 197
koiralo Avatar answered Sep 22 '22 06:09

koiralo