response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") # Success print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
But then:
outcome2 = sqlc.read.parquet(response) # fail
fails with:
AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
in
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)
The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?
Using Spark 2.1.1. Also fails in 2.2.0.
Found this bug report, but was fixed in 2.0.1, 2.1.0.
UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".
Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.
Overall, Parquet's features of storing data in columnar format together with schema and typed data allow efficient use for analytical purposes.
Inferring the Schema Using Reflection The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns.
This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.
You could check if the DataFrame is empty with outcome.rdd.isEmpty()
before writing it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With