Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to infer schema when loading Parquet file

response = "mi_or_chd_5"  outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") # Success print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true))) 

But then:

outcome2 = sqlc.read.parquet(response)  # fail 

fails with:

AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;' 

in

/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw) 

The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?

Using Spark 2.1.1. Also fails in 2.2.0.

Found this bug report, but was fixed in 2.0.1, 2.1.0.

UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".

like image 379
user48956 Avatar asked Jul 06 '17 16:07

user48956


People also ask

Does Parquet infer schema?

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.

Does Parquet file have schema?

Overall, Parquet's features of storing data in columnar format together with schema and typed data allow efficient use for analytical purposes.

How do you infer a schema in spark?

Inferring the Schema Using Reflection The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns.


1 Answers

This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.

You could check if the DataFrame is empty with outcome.rdd.isEmpty() before writing it.

like image 119
Javier Montón Avatar answered Sep 17 '22 16:09

Javier Montón