Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to infer schema for Parquet. It must be specified manually

I am running all the code from within EMR Notebooks.

spark.version

'3.0.1-amzn-0'

temp_df.printSchema()

root
 |-- dt: string (nullable = true)
 |-- AverageTemperature: double (nullable = true)
 |-- AverageTemperatureUncertainty: double (nullable = true)
 |-- State: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- weekday: integer (nullable = true)

temp_df.show(2)

+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|        dt|AverageTemperature|AverageTemperatureUncertainty|State|Country|year|month|day|weekday|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
|1855-05-01|            25.544|                        1.171| Acre| Brazil|1855|    5|  1|      3|
|1855-06-01|            24.228|                        1.103| Acre| Brazil|1855|    6|  1|      6|
+----------+------------------+-----------------------------+-----+-------+----+-----+---+-------+
only showing top 2 rows

temp_df.write.parquet(path='s3://project7878/clean_data/temperatures.parquet', mode='overwrite', partitionBy=['year'])

enter image description here

enter image description here

spark.read.parquet(path='s3://project7878/clean_data/temperatures.parquet').show(2)

An error was encountered:
Unable to infer schema for Parquet. It must be specified manually.;
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 353, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

I have referred to other stack overflow posts, but the solution provided there (problem due to empty files written) does not apply to me.

Please help me out. Thank You !!

like image 937
vjp Avatar asked Oct 19 '25 14:10

vjp


1 Answers

Don't use path in the read.parquet call:

>>> spark.read.parquet(path='a.parquet')
21/01/02 22:53:38 WARN DataSource: All paths were ignored:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home//bin/spark/python/pyspark/sql/readwriter.py", line 353, in parquet
    return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
  File "/home//bin/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/home//bin/spark/python/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
>>> spark.read.parquet('a.parquet')
DataFrame[_2: string, _1: double]

This is because the path argument does not exist.

It is valid if you use load

>>> spark.read.load(path='a', format='parquet')
DataFrame[_1: string, _2: string]
like image 93
VCLL Avatar answered Oct 22 '25 06:10

VCLL