Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple Spark applications with HiveContext

Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))

The other application terminates successfully.

I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.

This is enough to reproduce:

import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)

data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)

The sleep is just to keep the script running while I start the other process.

If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.

Does anyone know why that is?

like image 259
karlson Avatar asked Jan 10 '16 13:01

karlson


People also ask

Can a Spark application have multiple Spark sessions?

Spark applications can use multiple sessions to use different underlying data catalogs. You can use an existing Spark session to create a new session by calling the newSession method.

Can we have multiple Spark contexts?

Note: we can have multiple spark contexts by setting spark. driver. allowMultipleContexts to true . But having multiple spark contexts in the same jvm is not encouraged and is not considered as a good practice as it makes it more unstable and crashing of 1 spark context can affect the other.

Is it possible to have multiple SparkContext in single JVM?

Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one. param: config a Spark Config object describing the application configuration. Any settings in this config overrides the default configs as well as system properties.

How many SparkContext can be created?

You can only have one SparkContext at one time. Although you can start and stop it on demand as many times you want, but I remember an issue about it that said you should not close SparkContext unless you're done with Spark (which usually happens at the very end of your Spark application).


1 Answers

By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:

  • a running RDBMS server
  • a metastore database created using provided scripts
  • a proper Hive configuration

Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.

Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.

For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:

When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory

like image 109
zero323 Avatar answered Oct 23 '22 08:10

zero323