Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Saving empty DataFrame with known schema (Spark 2.2.1)

Is it possible to save an empty DataFrame with a known schema such that the schema is written to the file, even though it has 0 records?

def example(spark: SparkSession, path: String, schema: StructType) = { 
  val dataframe = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) 
  val dataframeWriter = dataframe.write.mode(SaveMode.Overwrite).format("parquet") 
  dataframeWriter.save(path) 

  spark.read.load(path) // ERROR!! No files to read, so schema unknown 
} 
like image 950
Erik Avatar asked Apr 13 '18 16:04

Erik


People also ask

How do you create an empty DataFrame with a specified schema?

In order to create an empty PySpark DataFrame manually with schema ( column names & data types) first, Create a schema using StructType and StructField . Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.

How do I enforce a schema in Spark DataFrame?

We can create a DataFrame programmatically using the following three steps. Create an RDD of Rows from an Original RDD. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.


2 Answers

This is the answer I received from Databricks Support:

This is actually a known issue in Spark. There is already fix done in opensource JIRA -> https://issues.apache.org/jira/browse/SPARK-23271. For more details on how this behavior will change from 2.4 please check this doc change https://github.com/apache/spark/pull/20525/files#diff-d8aa7a37d17a1227cba38c99f9f22511R1808 The behavior will be changed from Spark 2.4. Until then you need to go with any one of the following ways

  1. Save a dataframe with at-least one record to preserve its schema
  2. Save schema in a JSON file and use later
like image 112
Erik Avatar answered Oct 13 '22 20:10

Erik


I got a similar problem with Spark 2.1.0. I solved it using repartition before writing.

df.repartition(1).write.parquet("my/path")
like image 34
scauglog Avatar answered Oct 13 '22 20:10

scauglog