Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automated Testing in Apache Hive

I am about to embark on a project using Apache Hadoop/Hive which will involve a collection of hive query scripts to produce data feeds for various down stream applications. These scripts seem like ideal candidates for some unit testing - they represent the fulfillment of an API contract between my data store and client applications, and as such, it's trivial to write what the expected results should be for a given set of starting data. My issue is how to run these tests.

If I was working with SQL queries, I could use something like SQLlite or Derby to quickly bring up test databases, load test data and run a collection of query tests against them. Unfortunately, I am unaware of any such tools for Hive. At the moment, my best thought is to have the test framework bring up a hadoop local instance and run Hive against that, but I've never done that before and I'm not sure it will work, or be the right path.

Also, I'm not interested in a pedantic discussion about if what I am doing is unit testing or integration testing - I just need to be able to prove my code works.

like image 834
Mark Tozzi Avatar asked Feb 23 '11 15:02

Mark Tozzi


People also ask

Which type of automation testing is used in Hadoop testing?

Testing can be done either manually or by using automation tools. Hadoop has various kinds of testing like Unit Testing, Regression Testing, System Testing, and Performance Testing, etc. So these are the common testing types that we use in our normal testing as well as Hadoop and BigData testing.

What is Hive in testing?

Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS).


2 Answers

Hive has special standalone mode, specifically design for the testing purposes. In this case it can run without hadoop. I think it is exactly what you need. There is a link to the documentation:

http://wiki.apache.org/hadoop/Hive/HiveServer

like image 98
David Gruzman Avatar answered Sep 25 '22 04:09

David Gruzman


I'm working as part of a team to support a big data and analytics platform, and we also have this kind of issue.

We've been searching for a while and we found two pretty promising tools: https://github.com/klarna/HiveRunner https://github.com/bobfreitas/HadoopMiniCluster

HiveRunner is a framework built on top of JUnit to test Hive Queries. It starts a standalone HiveServer with in memory HSQL as the metastore. With it you can stub tables, views, mock samples, etc.

There are some limitations on Hive versions though, but I definitely recommend it

Hope it helps you =)

like image 40
Julio Farah Avatar answered Sep 24 '22 04:09

Julio Farah