Is it possible to save an empty DataFrame with a known schema such that the schema is written to the file, even though it has 0 records?
def example(spark: SparkSession, path: String, schema: StructType) = {
val dataframe = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
val dataframeWriter = dataframe.write.mode(SaveMode.Overwrite).format("parquet")
dataframeWriter.save(path)
spark.read.load(path) // ERROR!! No files to read, so schema unknown
}
In order to create an empty PySpark DataFrame manually with schema ( column names & data types) first, Create a schema using StructType and StructField . Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.
We can create a DataFrame programmatically using the following three steps. Create an RDD of Rows from an Original RDD. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
This is the answer I received from Databricks Support:
This is actually a known issue in Spark. There is already fix done in opensource JIRA -> https://issues.apache.org/jira/browse/SPARK-23271. For more details on how this behavior will change from 2.4 please check this doc change https://github.com/apache/spark/pull/20525/files#diff-d8aa7a37d17a1227cba38c99f9f22511R1808 The behavior will be changed from Spark 2.4. Until then you need to go with any one of the following ways
- Save a dataframe with at-least one record to preserve its schema
- Save schema in a JSON file and use later
I got a similar problem with Spark 2.1.0. I solved it using repartition before writing.
df.repartition(1).write.parquet("my/path")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With