Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to test (unit test) on ETL process?

I know several small companies do not do testing on ETL process, but that seems to be suboptimal from the perspective of software engineering.

How do people usually do testing/unit test/functional test on ETL process?

like image 307
Hello lad Avatar asked Jun 14 '16 10:06

Hello lad


People also ask

What is unit testing in ETL testing?

Unit testing is a software development process in which the smallest testable parts of an application, called units, are individually and independently scrutinized for proper operation. This testing methodology is done during the development process by the software developers and sometimes QA staff.

How do you write test cases for ETL testing?

To design an ETL test case, you should know what ETL is and understand the data transformation requirement. ETL is short for Extract, Transform and Load. The objective of ETL testing is to assure that the data that has been loaded from a source to a destination after business transformation is accurate.

What is unit testing in data warehouse?

Testing is very important for data warehouse systems for data validation and to make them work correctly and efficiently. There are three basic levels of testing performed on data warehouse which are as follows : Unit Testing – This type of testing is being performed at the developer's end.

How do you perform a unit test?

A typical unit test contains 3 phases: First, it initializes a small piece of an application it wants to test (also known as the system under test, or SUT), then it applies some stimulus to the system under test (usually by calling a method on it), and finally, it observes the resulting behavior.


2 Answers

We recently worked on a project where the governance board demanded 'You must have Unit Tests' and so we tried our best.

What worked for us was have each ETL solution start and end with a QA/Test package.

Anything unexpected discovered by these packages was logged into an audit table and a Fail Package event was then raised to stop the entire Job - We figured it was better to run with yesterdays good data than risk reporting against possible bad 'today' data.

The starting package would do db schema and data sanity checks. Data Sanity involved checking for duplicate or missing data caused by a lack of Referential Integrity in the source systems. Schema checks ensured that any schema changes that did not get applied during Continuous integration were detected.

The end package would check the results of any transformations. These included:

  • Comparing record counts between source|destination
  • Checking specific transforms (eg: all date values changed to appropriate SK value, all string values RTrimed)
  • Ensuring all SK fields were populated (-1 instead of nulls)

Most of these tests were SQL statements the used the built in schema objects of our database, so they were not to onerous to create.

In addition, as part of our development process we would create views that had the end result of any transformations we were doing. We would make use of these views to validate our package transformations.

Each of these checks created a record in our special audit table. That way we could provide a comprehensive list of all the tests and checks we had done each running of the process to satisfy the governance peoples.

(We also had a separate set of packages that would unit test each QA test by means of creating dummy tables, populating them, running the test then confirming the appropriate audit record was written. As Nick stated, this was a lot of work and of little real value)

like image 72
Joe Avatar answered Sep 18 '22 12:09

Joe


testing of an ETL is usually a problem. More precisely, testing isnt problem, problem is how to get reasonable test data. ETL is typically tested on production data. Aside of the security issue, the problem with production data is that does not cover functionality of ETL sufficiently (typically about 40% of business rules isnt covered by production data sample) and it takes too much of time to process.

Recently we have developed a test data generator (for more detail, please look for GTL QAceGen: Business Logic Driven Data Generator on Informatica Market Place) which generate test data based into source tables/files on business rule specification. Tool takes into consideration any foreign key applied and it works for any major ETL and/databases.

This tool helps to speed up testing cycle by at least 50% (compared to manual testing) an covers 100% of all business rules. It also generates quite detailed reports and more importantly, these tests can be repeated at any time (ie regression tests).

like image 45
Pavel Kochan Avatar answered Sep 20 '22 12:09

Pavel Kochan