I am trying to run SQL queries using the spark.sql() or sqlContext.sql() method (here spark is the variable for SparkSession object available to us when we start EMR Notebook) on a public dataset using EMR notebook attached to an EMR cluster which has Hadoop, Spark and Livy installed. But on running any basic SQL query I face the error: <pre class="prettyprint"><code>AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient; </code></pre> I want to use SQL queries so I do not want to use Dataframe API as an alternative. This spark EMR cluster does not have a separate Hive component installed and I don't intend to use that. I have tried looking for various causes of this issue, one such cause could be that EMR notebook may not have write permission to create the metastore_db. However, I could not confirm this. I have tried to find this error in log files in the cluster but could not find it and am not sure which file may contain this error in order to get more details. Steps to reproduce the problem: <ol> <li>Create an AWS EMR cluster using the console and using quick start view, select spark option. It will include Spark 2.4.3 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.1. It can either just have 1 master and 2 core nodes, or even 1 master node by itself.</li> <li>Create an EMR Notebook from the Notebooks link in the EMR page, attach it to the cluster that you just created and open it (by default the kernel chosen will be pyspark as seen on top right of the notebook).</li> <li>The code I am using runs a spark.sql query on amazon reviews dataset which is public.</li> <li>Code:</li> </ol> <pre class="prettyprint"><code># Importing data from s3 input_bucket = 's3://amazon-reviews-pds' input_path = '/parquet/product_category=Books/*.parquet' df = spark.read.parquet(input_bucket + input_path) # Register temporary view df.createOrReplaceTempView("reviews") sqlDF = sqlContext.sql("""SELECT product_id FROM reviews LIMIT 5""") </code></pre> I expect 5 product_id from this dataset to be returned however I get the error: <pre class="prettyprint"><code>u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 358, in sql return self.sparkSession.sql(sqlQuery) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' </code></pre>

I had the same problem and I realized that I didn't have Hive on my EMR cluster. After launching another cluster and making sure that Hive was selected, it worked.

Notebook should run on the EMR cluster with compatible HIVE version <img src="https://i.stack.imgur.com/WdKcH.png" alt="enter image description here">

How to fix error on pyspark EMR Notebook - AnalysisException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Tags:

I am trying to run SQL queries using the spark.sql() or sqlContext.sql() method (here spark is the variable for SparkSession object available to us when we start EMR Notebook) on a public dataset using EMR notebook attached to an EMR cluster which has Hadoop, Spark and Livy installed. But on running any basic SQL query I face the error:

Click to copy

AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;

I want to use SQL queries so I do not want to use Dataframe API as an alternative.

This spark EMR cluster does not have a separate Hive component installed and I don't intend to use that. I have tried looking for various causes of this issue, one such cause could be that EMR notebook may not have write permission to create the metastore_db. However, I could not confirm this. I have tried to find this error in log files in the cluster but could not find it and am not sure which file may contain this error in order to get more details.

Steps to reproduce the problem:

Create an AWS EMR cluster using the console and using quick start view, select spark option. It will include Spark 2.4.3 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.1. It can either just have 1 master and 2 core nodes, or even 1 master node by itself.
Create an EMR Notebook from the Notebooks link in the EMR page, attach it to the cluster that you just created and open it (by default the kernel chosen will be pyspark as seen on top right of the notebook).
The code I am using runs a spark.sql query on amazon reviews dataset which is public.
Code:

Click to copy

# Importing data from s3
input_bucket = 's3://amazon-reviews-pds'
input_path = '/parquet/product_category=Books/*.parquet'
df = spark.read.parquet(input_bucket + input_path)
# Register temporary view
df.createOrReplaceTempView("reviews")
sqlDF = sqlContext.sql("""SELECT product_id FROM reviews LIMIT 5""")

I expect 5 product_id from this dataset to be returned however I get the error:

Click to copy

u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 358, in sql
    return self.sparkSession.sql(sqlQuery)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'

436

asked Sep 04 '19 00:09

user10259140

2 Answers

I had the same problem and I realized that I didn't have Hive on my EMR cluster.

After launching another cluster and making sure that Hive was selected, it worked.

128

answered Nov 01 '22 09:11

Edison Gustavo Muenz

Notebook should run on the EMR cluster with compatible HIVE version enter image description here

answered Nov 01 '22 09:11

ankursingh1000

Related questions
                            
                                How does Serialized RDD occupy less space in memory?
                            
                                Error: Could not write class iw because it exceeds JVM code size limits. Method code too large
                            
                                Scala: How to combine two data frames?
                            
                                How to implement `except` in Apache Spark based on subset of columns?
                            
                                how to convert a timestamp into string (without changing timezone)?
                            
                                update a dataframe column with new values
                            
                                How YARN knows data locality in Apache spark in cluster mode
                            
                                How do I run Spark jobs concurrently in the same AWS EMR cluster ?
                            
                                S3 Slow Down exception for Spark program [duplicate]
                            
                                Spark Dataframe upsert to Elasticsearch
                            
                                How to cast an array of struct in a spark dataframe using selectExpr?
                            
                                can't resolve ... given input columns
                            
                                Spark DataFrame is Untyped vs DataFrame has schema?
                            
                                Spark dataframe column naming conventions / restrictions
                            
                                Extract and Visualize Model Trees from Sparklyr
                            
                                Spark - Reading partitioned data from S3 - how does partitioning happen?
                            
                                How can I rename a PySpark dataframe column by index? (handle duplicated column names)
                            
                                Spark sampling options in JSON reader ignored?
                            
                                Pyspark DataFrame: Split column with multiple values into rows
                            
                                Group days into weeks with totals PySpark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to fix error on pyspark EMR Notebook - AnalysisException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Tags:

apache-spark

hadoop

pyspark

amazon-emr

hive-metastore

user10259140

People also ask

2 Answers

Edison Gustavo Muenz

ankursingh1000

Recent Activity

Donate For Us