How to create an empty DataFrame? Why "ValueError: RDD is empty"?

Tags:

apache-spark

pyspark

I am trying to create an empty dataframe in Spark (Pyspark).

I am using similar approach to the one discussed here enter link description here, but it is not working.

This is my code

df = sqlContext.createDataFrame(sc.emptyRDD(), schema)

This is the error

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame rdd, schema = self._createFromRDD(data, schema, samplingRatio) File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD struct = self._inferSchema(rdd, samplingRatio) File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema first = rdd.first() File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first raise ValueError("RDD is empty") ValueError: RDD is empty

907

asked Jan 06 '16 02:01

user3276768

1 Answers

extending Joe Widen's answer, you can actually create the schema with no fields like so:

schema = StructType([])

so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[].

>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema) DataFrame[] >>> empty.schema StructType(List())

In Scala, if you choose to use sqlContext.emptyDataFrame and check out the schema, it will return StructType().

scala> val empty = sqlContext.emptyDataFrame empty: org.apache.spark.sql.DataFrame = []  scala> empty.schema res2: org.apache.spark.sql.types.StructType = StructType()

answered Oct 07 '22 01:10

Ton Torres

Related questions
                            
                                Adding a group count column to a PySpark dataframe
                            
                                how to get max(date) from given set of data grouped by some fields using pyspark?
                            
                                Google Dataflow vs Apache Spark
                            
                                Building a row from a dict in pySpark
                            
                                Column name with dot spark
                            
                                How to uncache RDD?
                            
                                Spark Equivalent of IF Then ELSE
                            
                                apache spark - check if file exists
                            
                                Would Spark unpersist the RDD itself when it realizes it won't be used anymore?
                            
                                Debugging "Managed memory leak detected" in Spark 1.6.0
                            
                                How to check status of Spark applications from the command line?
                            
                                Spark 2.0 Dataset vs DataFrame
                            
                                Methods for writing Parquet files using Python?
                            
                                Extremely slow S3 write times from EMR/ Spark
                            
                                The value of "spark.yarn.executor.memoryOverhead" setting?
                            
                                What are the differences between saveAsTable and insertInto in different SaveMode(s)?
                            
                                Create a custom Transformer in PySpark ML
                            
                                spark access first n rows - take vs limit
                            
                                When to cache a DataFrame?
                            
                                How do I read a parquet in PySpark written from Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With