Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL using Python: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

I want to test basic stuff with Spark SQL. I want to load a csv. file, saved on my laptop, and run a few sql queries on it. But somehow I cannot load the data using sqlContext. I get the error:

Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. 

I am not using Hive, however.

I am using windows 10 and installed python using Anaconda. I installed Spark 2.0.2 prebuild for hadoop 2.6. I use iPython Notebook as a User Interface.

My code is as follows:

file = "C:/Andra/spark-2.0.2-bin-hadoop2.6/zip.csv"
df = sqlContext\
    .read \
    .format("com.databricks.spark.csv")\
    .option("header", "true")\
    .option("inferschema", "true")\
    .option("mode", "DROPMALFORMED")\
    .load(file)

The problem lies in Spark SQL since I can load the same file using

textFile=sc.textFile("C:/Andra/spark-2.0.2-bin-hadoop2.6/zip.csv")

If I want to run an example from the Spark SQL documentation https://spark.apache.org/docs/latest/sql-programming-guide.html I get the same error.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
df = spark.read.json("C:/Andra/spark-2.0.2-bin-hadoop2.6/examples/src/main/resources/people.json")

I was under the impression that I can use Spark SQL without using Hive, since the data I am using is saved localy on my laptop. Furthermore the same documentation as above implies just that:

"One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section."

And there are also examples of creating a spark session using Hive. So the one above would be useless, if using hive was mandatory.

However, I wanted to configure Hive to see if this solves the problem. The documentation guide (https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables) states

"Configuration of Hive is done by placing your hive-site.xml , core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/."

I could not find those documents, though.

So my questions are these:

  • Do I need Hive for using Spark SQL?
  • If not, what can I do to get Spark SQL working?
  • If yes, how can I configure it correcty and were can I find those files needed?

Any help is appreciated! Thank you!

Here is the complete error statement:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-4-e50d7a8fb32b> in <module>()
      1 file = "C:/Andra/spark-2.0.2-bin-hadoop2.6/zip.csv"
----> 2 df = sqlContext    .read     .format("com.databricks.spark.csv")    .option("header", "true")    .option("inferschema", "true")    .option("mode", "DROPMALFORMED")    .load(file)

C:\Andra\spark-2.0.2-bin-hadoop2.6\python\pyspark\sql\readwriter.pyc in load(self, path, format, schema, **options)
    145         self.options(**options)
    146         if isinstance(path, basestring):
--> 147             return self._df(self._jreader.load(path))
    148         elif path is not None:
    149             if type(path) != list:

C:\Andra\spark-2.0.2-bin-hadoop2.6\python\lib\py4j-0.10.3-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

C:\Andra\spark-2.0.2-bin-hadoop2.6\python\pyspark\sql\utils.pyc in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

C:\Andra\spark-2.0.2-bin-hadoop2.6\python\lib\py4j-0.10.3-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o110.load.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
    at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:189)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
    at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
    at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
    at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
    at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
    at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
    at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
    at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
    at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
    at org.apache.spark.sql.hive.HiveSessionState$$anon$1.<init>(HiveSessionState.scala:63)
    at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
    at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
    at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
    at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
    at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
    at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
    ... 33 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
    ... 39 more
Caused by: java.lang.NullPointerException
    at org.apache.thrift.transport.TSocket.open(TSocket.java:170)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:236)
    at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
    ... 44 more
like image 512
pluecky Avatar asked Feb 05 '23 08:02

pluecky


1 Answers

I recently ran into the same problem. In my case I was running two python jupyter notebooks on my local computer at the same time. The first notebook worked fine. The second one kept throwing the dreaded

Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient I am not sure how permissions work. It seems like the first notebook to run some how locks the local meta store. Make sense that meta store can not be shared between two different sessions.

maybe someone knows how enable multiple notes books?

Andy

like image 111
AEDWIP Avatar answered May 01 '23 02:05

AEDWIP