I'm trying to run a notebook on Analytics for Apache Spark running on Bluemix, but I hit the following error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and 
run build/sbt assembly", Py4JJavaError(u'An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))
The error is intermittent - it doesn't always happen. The line of code in question is:
df = sqlContext.read.format('jdbc').options(
            url=url, 
            driver='com.ibm.db2.jcc.DB2Driver', 
            dbtable='SAMPLE.ASSETDATA'
        ).load()
There are a few similar questions on stackoverflow, but they aren't asking about the spark service on bluemix.
That statement initializes a HiveContext under the covers. The HiveContext then initializes a local Derby database to hold its metadata. The Derby database is created in the current directory by default. The reported problem occurs under these circumstances (among others):
Until IBM changes the default setup to avoid this problem, possible workarounds are:
For case 1, delete the leftover lockfiles. From a Python notebook, this is done by executing:
!rm -f ./metastore_db/*.lck
For case 2, change the current working directory before the Hive context is created. In a Python notebook, this will change into a newly created directory:
import os
import tempfile
os.chdir(tempfile.mkdtemp())
But beware, it will clutter the filesystem with a new directory and Derby database each time you run that notebook.
I happen to know that IBM is working on a fix. Please use these workarounds only if you encounter the problem, not proactively.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With