How to use in-memory Derby database for testing with Hive (Scala)

Tags:

I'm using spark-hive 2.3.0 with Scala 2.11 and setting up a unit test framework. spark-hive comes with TestHiveContext and TestHiveSparkSession which conveniently allow invoking Hive from unit tests without having Hadoop, Spark, or a cluster running, which is great for automated tests.

Hive needs a database for its metastore, and when run in this way it uses Derby as an embedded database configured with javax.jdo.option.ConnectionURL which by default is jdbc:derby:;databaseName=<file-path>;create=true. The <file-path> is a location in the local filesystem and is one option for running Derby.

Another option is running Derby in-memory which is usually as easy as changing this URL to something like jdbc:derby:memory:databaseName;create=true. Hoewever, this isn't possible with Hive because the configuration is made in an internal HiveUtils class and can't be overridden. I've tried changing it in my Spark Session Builder but my change later gets blown away by HiveUtils when I create my TestHiveContext.

In my case an in-memory database is preferable because our developers run on Windows (definitely not my/our choice), and when these files are created there are often problems like permissions or invalid characters in the filename (since Hadoop was never really intended to work on Windows), and these files often get left behind because they can't be cleaned up (due to these problems). We would like the tests to be completely self-contained so they can be run and finished with no side effects, so they can be run in multiple environments (developer, CI, Jenkins, AWS, etc).

Interestingly I see this in TestHive.scala:

{ // set the metastore temporary configuration
  val metastoreTempConf = HiveUtils.newTemporaryConfiguration(useInMemoryDerby = false) ++ Map(

So there is a flag for using an in-memory database, but this is not configurable and there is no code path where this is set to true.

Is there any way to configure or write this so that TestHive's Derby can be in-memory? Trying to set the value of javax.jdo.option.ConnectionURL through either hive-site.xml or hdfs-site.xml does not work, I think it is because of how TestHive, TestHiveContext and TestHiveSparkSession are initialized, they have their own code paths separate from the non-test paths. The functionality they provide is very helpful for a test framework but apparently doesn't provide a way to override this value and some other settings.

The best option I can see so far is to override or write my own TestHiveContext class that borrows a bunch of functionality from that class and overrides the parts that I need, but that's a relatively large undertaking for what I think could be done with a simple configuration change.

486

asked Apr 06 '18 19:04

Uncle Long Hair

1 Answers

I finally figured out how to do this, and wanted to share the answer in case someone else is trying to do the same thing.

My test class uses the SharedSparkContext trait, which provides a SparkContext reference by via var sc.

After the SparkContext is initialized (I used the beforeAll hook available in scalatest test frameworks), I create a TestHiveContext like this:

hc = new TestHiveContext(sc, false)

And then immediately afterwards, I can set the javax.jdo.option.ConnectionURL and presumably some other Hadoop and Hive configurations like this:

sc.hadoopConfiguration.set("javax.jdo.option.ConnectionURL", 
                           "jdbc:derby:memory:db;create=true")

This config param is used by Hive, but apparently has to be added to the Hadoop configuration, which is used to build the Hive test context.

The trick is the timing, this has to be done after Hadoop and Hive have initialized themselves (using config files and whatnot), and the scalatest framework is also initialized, and finally after the TestHive framework is initialized, but before you have run any tests. Trying to set this parameter before these other initializations means your setting will be overwritten before your tests run.

118

answered Sep 20 '22 20:09

Uncle Long Hair

Related questions
                            
                                Program written in generated code based on unit tests
                            
                                Vulnerability testing in ASP.NET MVC
                            
                                Get BOOST TEST test suite name inside a test case
                            
                                Python multiple inheritance unittest -
                            
                                Angular Unit Testing - Mocking methods/closures in same service
                            
                                py.test fixture from another file
                            
                                Configure Polymer web-component-tester to use a Selenium Grid server
                            
                                Python unittest, do something only if test fails
                            
                                CakePHP 3: Unable to insert fixtures when running phpunit
                            
                                Let xUnit combine parameters
                            
                                How to unit test a form submission when multiple forms on a route?
                            
                                Getting past entity framework BeginTransaction
                            
                                Is there a way to UnitTest a Class Library (Universal Windows) without testing a whole application?
                            
                                How to unit-test third-party library wrappers?
                            
                                How to mock/create X509Certificate2 for unit testing?
                            
                                Elegant way to testing the Services/Service Containers in Symfony3 with PHPUnit
                            
                                mock xmlrpc.client method python
                            
                                Unit Testing a BroadcastReceiver
                            
                                Unit testing Azure Function: Cannot create an instance of TraceWriter, how to mock?
                            
                                How to test react-router Links with Enzyme?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use in-memory Derby database for testing with Hive (Scala)

Tags:

unit-testing

hive

scalatest

derby

Uncle Long Hair

People also ask

1 Answers

Uncle Long Hair

Recent Activity

Donate For Us