Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create an empty dataFrame in Spark

I have a set of Avro based hive tables and I need to read data from them. As Spark-SQL uses hive serdes to read the data from HDFS, it is much slower than reading HDFS directly. So I have used data bricks Spark-Avro jar to read the Avro files from underlying HDFS dir.

Everything works fine except when the table is empty. I have managed to get the schema from the .avsc file of hive table using the following command but I am getting an error "No Avro files found"

val schemaFile = FileSystem.get(sc.hadoopConfiguration).open(new Path("hdfs://myfile.avsc"));

val schema = new Schema.Parser().parse(schemaFile);

spark.read.format("com.databricks.spark.avro").option("avroSchema", schema.toString).load("/tmp/myoutput.avro").show()

Workarounds:

I have placed an empty file in that directory and the same thing works fine.

Are there any other ways to achieve the same? like conf setting or something?

like image 961
Vinay Kumar Avatar asked Dec 03 '22 20:12

Vinay Kumar


2 Answers

You don't need to use emptyRDD. Here is what worked for me with PySpark 2.4:

empty_df = spark.createDataFrame([], schema) # spark is the Spark Session

If you already have a schema from another dataframe, you can just do this:

schema = some_other_df.schema

If you don't, then manually create the schema of the empty dataframe, for example:

schema = StructType([StructField("col_1", StringType(), True),
                     StructField("col_2", DateType(), True),
                     StructField("col_3", StringType(), True),
                     StructField("col_4", IntegerType(), False)]
                     )

I hope this helps.

like image 80
luvrock Avatar answered Dec 20 '22 00:12

luvrock


Similar to EmiCareOfCell44's answer, just a little bit more elegant and more "empty"

val emptySchema = StructType(Seq())
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row],
                emptySchema)
like image 28
Y.G. Avatar answered Dec 19 '22 23:12

Y.G.