I am trying to create an empty dataframe in Spark (Pyspark).
I am using similar approach to the one discussed here enter link description here, but it is not working.
This is my code
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
This is the error
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame rdd, schema = self._createFromRDD(data, schema, samplingRatio) File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD struct = self._inferSchema(rdd, samplingRatio) File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema first = rdd.first() File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first raise ValueError("RDD is empty") ValueError: RDD is empty
Create Empty RDD in PySpark Create an empty RDD by using emptyRDD() of SparkContext for example spark. sparkContext. emptyRDD() . Alternatively you can also get empty RDD by using spark.
isEmpty. Returns true if and only if the RDD contains no elements at all. An RDD may be empty even when it has at least 1 partition.
extending Joe Widen's answer, you can actually create the schema with no fields like so:
schema = StructType([])
so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[]
.
>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema) DataFrame[] >>> empty.schema StructType(List())
In Scala, if you choose to use sqlContext.emptyDataFrame
and check out the schema, it will return StructType()
.
scala> val empty = sqlContext.emptyDataFrame empty: org.apache.spark.sql.DataFrame = [] scala> empty.schema res2: org.apache.spark.sql.types.StructType = StructType()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With