I am trying to run SQL queries using the spark.sql() or sqlContext.sql() method (here spark is the variable for SparkSession object available to us when we start EMR Notebook) on a public dataset using EMR notebook attached to an EMR cluster which has Hadoop, Spark and Livy installed. But on running any basic SQL query I face the error:
AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
I want to use SQL queries so I do not want to use Dataframe API as an alternative.
This spark EMR cluster does not have a separate Hive component installed and I don't intend to use that. I have tried looking for various causes of this issue, one such cause could be that EMR notebook may not have write permission to create the metastore_db. However, I could not confirm this. I have tried to find this error in log files in the cluster but could not find it and am not sure which file may contain this error in order to get more details.
Steps to reproduce the problem:
Create an AWS EMR cluster using the console and using quick start view, select spark option. It will include Spark 2.4.3 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.1. It can either just have 1 master and 2 core nodes, or even 1 master node by itself.
Create an EMR Notebook from the Notebooks link in the EMR page, attach it to the cluster that you just created and open it (by default the kernel chosen will be pyspark as seen on top right of the notebook).
# Importing data from s3
input_bucket = 's3://amazon-reviews-pds'
input_path = '/parquet/product_category=Books/*.parquet'
df = spark.read.parquet(input_bucket + input_path)
# Register temporary view
df.createOrReplaceTempView("reviews")
sqlDF = sqlContext.sql("""SELECT product_id FROM reviews LIMIT 5""")
I expect 5 product_id from this dataset to be returned however I get the error:
u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 358, in sql
return self.sparkSession.sql(sqlQuery)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
To submit a Spark step using the consoleOpen the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/ . In the Cluster List, choose the name of your cluster. Scroll to the Steps section and expand it, then choose Add step.
I had the same problem and I realized that I didn't have Hive on my EMR cluster.
After launching another cluster and making sure that Hive was selected, it worked.
Notebook should run on the EMR cluster with compatible HIVE version
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With