Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix error on pyspark EMR Notebook - AnalysisException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

I am trying to run SQL queries using the spark.sql() or sqlContext.sql() method (here spark is the variable for SparkSession object available to us when we start EMR Notebook) on a public dataset using EMR notebook attached to an EMR cluster which has Hadoop, Spark and Livy installed. But on running any basic SQL query I face the error:

AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;

I want to use SQL queries so I do not want to use Dataframe API as an alternative.

This spark EMR cluster does not have a separate Hive component installed and I don't intend to use that. I have tried looking for various causes of this issue, one such cause could be that EMR notebook may not have write permission to create the metastore_db. However, I could not confirm this. I have tried to find this error in log files in the cluster but could not find it and am not sure which file may contain this error in order to get more details.

Steps to reproduce the problem:

  1. Create an AWS EMR cluster using the console and using quick start view, select spark option. It will include Spark 2.4.3 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.1. It can either just have 1 master and 2 core nodes, or even 1 master node by itself.

  2. Create an EMR Notebook from the Notebooks link in the EMR page, attach it to the cluster that you just created and open it (by default the kernel chosen will be pyspark as seen on top right of the notebook).

  3. The code I am using runs a spark.sql query on amazon reviews dataset which is public.
  4. Code:
# Importing data from s3
input_bucket = 's3://amazon-reviews-pds'
input_path = '/parquet/product_category=Books/*.parquet'
df = spark.read.parquet(input_bucket + input_path)
# Register temporary view
df.createOrReplaceTempView("reviews")
sqlDF = sqlContext.sql("""SELECT product_id FROM reviews LIMIT 5""")

I expect 5 product_id from this dataset to be returned however I get the error:

u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 358, in sql
    return self.sparkSession.sql(sqlQuery)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
like image 436
user10259140 Avatar asked Sep 04 '19 00:09

user10259140


People also ask

How do I use Spark code on EMR?

To submit a Spark step using the consoleOpen the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/ . In the Cluster List, choose the name of your cluster. Scroll to the Steps section and expand it, then choose Add step.


2 Answers

I had the same problem and I realized that I didn't have Hive on my EMR cluster.

After launching another cluster and making sure that Hive was selected, it worked.

like image 128
Edison Gustavo Muenz Avatar answered Nov 01 '22 09:11

Edison Gustavo Muenz


Notebook should run on the EMR cluster with compatible HIVE version enter image description here

like image 31
ankursingh1000 Avatar answered Nov 01 '22 09:11

ankursingh1000