Unable to infer schema when loading Parquet file

Tags:

response = "mi_or_chd_5"  outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") # Success print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

But then:

outcome2 = sqlc.read.parquet(response)  # fail

fails with:

AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)

The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?

Using Spark 2.1.1. Also fails in 2.2.0.

Found this bug report, but was fixed in 2.0.1, 2.1.0.

UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".

379

asked Jul 06 '17 16:07

user48956

1 Answers

This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.

You could check if the DataFrame is empty with outcome.rdd.isEmpty() before writing it.

119

answered Sep 17 '22 16:09

Javier Montón

Related questions
                            
                                How to write spark streaming DF to Kafka topic
                            
                                How to add third-party Java JAR files for use in PySpark
                            
                                How to integrate Apache Spark with MySQL for reading database tables as a spark dataframe? [closed]
                            
                                Filtering a pyspark dataframe using isin by exclusion [duplicate]
                            
                                Spark - How to write a single csv file WITHOUT folder?
                            
                                Mind blown: RDD.zip() method
                            
                                Spark add new column to dataframe with value from previous row
                            
                                Writing SQL vs using Dataframe APIs in Spark SQL
                            
                                How to work efficiently with SBT, Spark and "provided" dependencies?
                            
                                Apache Spark does not delete temporary directories
                            
                                What is Spark Job ?
                            
                                What is the relationship between Spark, Hadoop and Cassandra
                            
                                Spark get collection sorted by value
                            
                                How to limit the number of retries on Spark job failure?
                            
                                Scala Spark DataFrame : dataFrame.select multiple columns given a Sequence of column names
                            
                                overwriting a spark output using pyspark
                            
                                Cannot Read a file from HDFS using Spark
                            
                                How to create DataFrame from Scala's List of Iterables?
                            
                                Filter spark DataFrame on string contains
                            
                                How to change a column position in a spark dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unable to infer schema when loading Parquet file

Tags:

apache-spark

pyspark

parquet

user48956

People also ask

1 Answers

Javier Montón

Recent Activity

Donate For Us