I'm using spark-hive 2.3.0 with Scala 2.11 and setting up a unit test framework. spark-hive comes with TestHiveContext
and TestHiveSparkSession
which conveniently allow invoking Hive from unit tests without having Hadoop, Spark, or a cluster running, which is great for automated tests.
Hive needs a database for its metastore, and when run in this way it uses Derby as an embedded database configured with javax.jdo.option.ConnectionURL
which by default is jdbc:derby:;databaseName=<file-path>;create=true
. The <file-path>
is a location in the local filesystem and is one option for running Derby.
Another option is running Derby in-memory which is usually as easy as changing this URL to something like jdbc:derby:memory:databaseName;create=true
. Hoewever, this isn't possible with Hive because the configuration is made in an internal HiveUtils
class and can't be overridden. I've tried changing it in my Spark Session Builder but my change later gets blown away by HiveUtils
when I create my TestHiveContext
.
In my case an in-memory database is preferable because our developers run on Windows (definitely not my/our choice), and when these files are created there are often problems like permissions or invalid characters in the filename (since Hadoop was never really intended to work on Windows), and these files often get left behind because they can't be cleaned up (due to these problems). We would like the tests to be completely self-contained so they can be run and finished with no side effects, so they can be run in multiple environments (developer, CI, Jenkins, AWS, etc).
Interestingly I see this in TestHive.scala
:
{ // set the metastore temporary configuration
val metastoreTempConf = HiveUtils.newTemporaryConfiguration(useInMemoryDerby = false) ++ Map(
So there is a flag for using an in-memory database, but this is not configurable and there is no code path where this is set to true
.
Is there any way to configure or write this so that TestHive
's Derby can be in-memory? Trying to set the value of javax.jdo.option.ConnectionURL
through either hive-site.xml or hdfs-site.xml does not work, I think it is because of how TestHive
, TestHiveContext
and TestHiveSparkSession
are initialized, they have their own code paths separate from the non-test paths. The functionality they provide is very helpful for a test framework but apparently doesn't provide a way to override this value and some other settings.
The best option I can see so far is to override or write my own TestHiveContext
class that borrows a bunch of functionality from that class and overrides the parts that I need, but that's a relatively large undertaking for what I think could be done with a simple configuration change.
Derby mainly support on-disk database. It also provides in-memory database for testing and developing applications. By following backup procedures, in-memory database can be stored and be used as either an in-memory database or normal on-disk database at a later time.
Configure hive with Hadoop edit the hive-env.sh file, which is placed in the $HIVE_HOME/conf directory. Hive installation is completed successfully. Now you require an external database server to configure Metastore. We use Apache Derby database.
Derby is also an open-source relational database tool that comes with a hive(by default) and is owned by apache. Nowadays, From an industry perspective, the derby is used for only testing purposes, and for deployment purposes, Metastore of MySql is used. Prerequisite: Hadoop should be pre-installed.
I finally figured out how to do this, and wanted to share the answer in case someone else is trying to do the same thing.
My test class uses the SharedSparkContext
trait, which provides a SparkContext
reference by via var sc
.
After the SparkContext is initialized (I used the beforeAll
hook available in scalatest
test frameworks), I create a TestHiveContext
like this:
hc = new TestHiveContext(sc, false)
And then immediately afterwards, I can set the javax.jdo.option.ConnectionURL
and presumably some other Hadoop and Hive configurations like this:
sc.hadoopConfiguration.set("javax.jdo.option.ConnectionURL",
"jdbc:derby:memory:db;create=true")
This config param is used by Hive, but apparently has to be added to the Hadoop configuration, which is used to build the Hive test context.
The trick is the timing, this has to be done after Hadoop and Hive have initialized themselves (using config files and whatnot), and the scalatest framework is also initialized, and finally after the TestHive framework is initialized, but before you have run any tests. Trying to set this parameter before these other initializations means your setting will be overwritten before your tests run.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With