Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does pyspark fail with "Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars."?

I am using a standalone cluster of apache spark version 2.0.0 with two nodes and i have not installed hive.I am getting the following error on creating a dataframe.

from pyspark import SparkContext
from pyspark import SQLContext
sqlContext = SQLContext(sc)
l = [('Alice', 1)]
sqlContext.createDataFrame(l).collect()
---------------------------------------------------------------------------
IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-9-63bc4f21f23e> in <module>()
----> 1 sqlContext.createDataFrame(l).collect()

/home/mok/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.pyc in createDataFrame(self, data, schema, samplingRatio)
    297         Py4JJavaError: ...
    298         """
--> 299         return self.sparkSession.createDataFrame(data, schema, samplingRatio)
    300 
    301     @since(1.3)

/home/mok/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.pyc in createDataFrame(self, data, schema, samplingRatio)
    522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
--> 524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
    525         df = DataFrame(jdf, self._wrapped)
    526         df._schema = schema

/home/mok/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    931         answer = self.gateway_client.send_command(command)
    932         return_value = get_return_value(
--> 933             answer, self.gateway_client, self.target_id, self.name)
    934 
    935         for temp_arg in temp_args:

/home/mok/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: u'Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.'

So should i install Hive or edit the configurations.

like image 227
naveed mohad abdul Avatar asked Aug 27 '16 15:08

naveed mohad abdul


People also ask

Does spark need Hive Metastore?

Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory non-Hive catalogs unless you're in spark-shell that does the opposite). The default external catalog implementation is controlled by spark.

What is Pyspark HiveContext?

Class HiveContextA variant of Spark SQL that integrates with data stored in Hive. Configuration for Hive is read from hive-site. xml on the classpath. It supports running both SQL and HiveQL commands.


2 Answers

If you have several java versions you'll have to figure out which spark is using (I did this using trial and error , starting with

JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"

and ending with

JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
like image 159
jeremy_rutman Avatar answered Oct 01 '22 12:10

jeremy_rutman


IllegalArgumentException: u'Unable to locate hive jars to connect to metastore. Please set spark.sql.hive.metastore.jars.'

I had the same issue and fixed it by using Java 8. Make sure you install JDK 8 and set the environment variables accordingly.

Do not use Java 11 with Spark / pyspark 2.4.

like image 36
ashwin kumar Avatar answered Oct 01 '22 10:10

ashwin kumar