How to create an empty dataFrame in Spark

Question

I have a set of Avro based hive tables and I need to read data from them. As Spark-SQL uses hive serdes to read the data from HDFS, it is much slower than reading HDFS directly. So I have used data bricks Spark-Avro jar to read the Avro files from underlying HDFS dir.

Everything works fine except when the table is empty. I have managed to get the schema from the .avsc file of hive table using the following command but I am getting an error "No Avro files found"

val schemaFile = FileSystem.get(sc.hadoopConfiguration).open(new Path("hdfs://myfile.avsc"));

val schema = new Schema.Parser().parse(schemaFile);

spark.read.format("com.databricks.spark.avro").option("avroSchema", schema.toString).load("/tmp/myoutput.avro").show()

Workarounds:

I have placed an empty file in that directory and the same thing works fine.

Are there any other ways to achieve the same? like conf setting or something?

luvrock · Accepted Answer

You don't need to use emptyRDD. Here is what worked for me with PySpark 2.4:

empty_df = spark.createDataFrame([], schema) # spark is the Spark Session

If you already have a schema from another dataframe, you can just do this:

schema = some_other_df.schema

If you don't, then manually create the schema of the empty dataframe, for example:

schema = StructType([StructField("col_1", StringType(), True),
                     StructField("col_2", DateType(), True),
                     StructField("col_3", StringType(), True),
                     StructField("col_4", IntegerType(), False)]
                     )

I hope this helps.

Y.G. · Answer

Similar to EmiCareOfCell44's answer, just a little bit more elegant and more "empty"

val emptySchema = StructType(Seq())
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row],
                emptySchema)

How to create an empty dataFrame in Spark

Tags:

scala

apache-spark

apache-spark-sql

avro

spark-avro

Vinay Kumar

2 Answers

luvrock

Y.G.

Recent Activity

Donate For Us

How to create an empty dataFrame in Spark

Tags:

scala

apache-spark

apache-spark-sql

avro

spark-avro

Vinay Kumar

2 Answers

luvrock

Y.G.

Related questions

Recent Activity

Donate For Us