Having two separate pyspark applications that instantiate a <code>HiveContext</code> in place of a <code>SQLContext</code> lets one of the two applications fail with the error: <blockquote> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039)) </blockquote> The other application terminates successfully. I am using Spark 1.6 from the Python API and want to make use of some <code>Dataframe</code> functions, that are only supported with a <code>HiveContext</code> (e.g. <code>collect_set</code>). I've had the same issue on 1.5.2 and earlier. This is enough to reproduce: <pre class="prettyprint"><code>import time from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext conf = SparkConf() sc = SparkContext(conf=conf) sq = HiveContext(sc) data_source = '/tmp/data.parquet' df = sq.read.parquet(data_source) time.sleep(60) </code></pre> The <code>sleep</code> is just to keep the script running while I start the other process. If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace <code>HiveContext</code> with <code>SQLContext</code> everything's fine. Does anyone know why that is?

By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need: <ul> <li>a running RDBMS server</li> <li>a metastore database created using provided scripts </li> <li>a proper Hive configuration </li> </ul> Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore. Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode. For development you can start applications in different working directories. This will create separate <code>metastore_db</code> for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development: <blockquote> When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory </blockquote>

Multiple Spark applications with HiveContext

Tags:

apache-spark

pyspark

hive

Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))

The other application terminates successfully.

I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.

This is enough to reproduce:

import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)

data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)

The sleep is just to keep the script running while I start the other process.

If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.

Does anyone know why that is?

259

asked Jan 10 '16 13:01

karlson

1 Answers

By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:

a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration

Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.

Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.

For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:

When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory

109

answered Oct 23 '22 08:10

zero323

Related questions
                            
                                Accessing HDFS HA from spark job (UnknownHostException error)
                            
                                Spark worker memory
                            
                                Why is a Spark Row object so big compared to equivalent structures?
                            
                                Understanding Spark shuffle spill
                            
                                How to transform RDD, Dataframe or Dataset straight to a Broadcast variable without collect?
                            
                                More efficient way to loop through PySpark DataFrame and create new columns
                            
                                Dag-scheduler-event-loop java.lang.OutOfMemoryError: unable to create new native thread
                            
                                Passing a map with struct-type key into a Spark UDF
                            
                                Handling microseconds in Spark Scala
                            
                                How to change user in hdfs using sparkSubmit in java
                            
                                Spark how to use a UDF with a Join
                            
                                How to validate Spark SQL expression without executing it?
                            
                                how to process data in chunks/batches with kafka streams?
                            
                                Spark: UDF executed many times
                            
                                Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
                            
                                How do you perform blocking IO in apache spark job?
                            
                                How to convert matrix to RDD[Vector] in spark
                            
                                java.lang.NoSuchMethodError Jackson databind and Spark
                            
                                Hadoop 2.6 Connecting to ResourceManager at /0.0.0.0:8032
                            
                                Apply function to each row of Spark DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With