Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use in-memory Derby database for testing with Hive (Scala)

I'm using spark-hive 2.3.0 with Scala 2.11 and setting up a unit test framework. spark-hive comes with TestHiveContext and TestHiveSparkSession which conveniently allow invoking Hive from unit tests without having Hadoop, Spark, or a cluster running, which is great for automated tests.

Hive needs a database for its metastore, and when run in this way it uses Derby as an embedded database configured with javax.jdo.option.ConnectionURL which by default is jdbc:derby:;databaseName=<file-path>;create=true. The <file-path> is a location in the local filesystem and is one option for running Derby.

Another option is running Derby in-memory which is usually as easy as changing this URL to something like jdbc:derby:memory:databaseName;create=true. Hoewever, this isn't possible with Hive because the configuration is made in an internal HiveUtils class and can't be overridden. I've tried changing it in my Spark Session Builder but my change later gets blown away by HiveUtils when I create my TestHiveContext.

In my case an in-memory database is preferable because our developers run on Windows (definitely not my/our choice), and when these files are created there are often problems like permissions or invalid characters in the filename (since Hadoop was never really intended to work on Windows), and these files often get left behind because they can't be cleaned up (due to these problems). We would like the tests to be completely self-contained so they can be run and finished with no side effects, so they can be run in multiple environments (developer, CI, Jenkins, AWS, etc).

Interestingly I see this in TestHive.scala:

{ // set the metastore temporary configuration
  val metastoreTempConf = HiveUtils.newTemporaryConfiguration(useInMemoryDerby = false) ++ Map(

So there is a flag for using an in-memory database, but this is not configurable and there is no code path where this is set to true.

Is there any way to configure or write this so that TestHive's Derby can be in-memory? Trying to set the value of javax.jdo.option.ConnectionURL through either hive-site.xml or hdfs-site.xml does not work, I think it is because of how TestHive, TestHiveContext and TestHiveSparkSession are initialized, they have their own code paths separate from the non-test paths. The functionality they provide is very helpful for a test framework but apparently doesn't provide a way to override this value and some other settings.

The best option I can see so far is to override or write my own TestHiveContext class that borrows a bunch of functionality from that class and overrides the parts that I need, but that's a relatively large undertaking for what I think could be done with a simple configuration change.

like image 486
Uncle Long Hair Avatar asked Apr 06 '18 19:04

Uncle Long Hair


People also ask

Is Derby an in memory database?

Derby mainly support on-disk database. It also provides in-memory database for testing and developing applications. By following backup procedures, in-memory database can be stored and be used as either an in-memory database or normal on-disk database at a later time.

How do I start a hive Derby database?

Configure hive with Hadoop edit the hive-env.sh file, which is placed in the $HIVE_HOME/conf directory. Hive installation is completed successfully. Now you require an external database server to configure Metastore. We use Apache Derby database.

What is Derby in hive?

Derby is also an open-source relational database tool that comes with a hive(by default) and is owned by apache. Nowadays, From an industry perspective, the derby is used for only testing purposes, and for deployment purposes, Metastore of MySql is used. Prerequisite: Hadoop should be pre-installed.


1 Answers

I finally figured out how to do this, and wanted to share the answer in case someone else is trying to do the same thing.

My test class uses the SharedSparkContext trait, which provides a SparkContext reference by via var sc.

After the SparkContext is initialized (I used the beforeAll hook available in scalatest test frameworks), I create a TestHiveContext like this:

hc = new TestHiveContext(sc, false)

And then immediately afterwards, I can set the javax.jdo.option.ConnectionURL and presumably some other Hadoop and Hive configurations like this:

sc.hadoopConfiguration.set("javax.jdo.option.ConnectionURL", 
                           "jdbc:derby:memory:db;create=true")

This config param is used by Hive, but apparently has to be added to the Hadoop configuration, which is used to build the Hive test context.

The trick is the timing, this has to be done after Hadoop and Hive have initialized themselves (using config files and whatnot), and the scalatest framework is also initialized, and finally after the TestHive framework is initialized, but before you have run any tests. Trying to set this parameter before these other initializations means your setting will be overwritten before your tests run.

like image 118
Uncle Long Hair Avatar answered Sep 20 '22 20:09

Uncle Long Hair