Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the proper way to setup and seed a database with artificial data for integration testing

Let's say I have 2 tables in a database, one called students and the other called departments. students looks like the following:

department_id, student_id, class, name, age, gender, rank

and departments looks like:

department_id, department_name, campus_id, number_of_faculty

I have an API that can query the database and retrieve various information from the 2 tables. For example, I have an end point that can get number of students on each campus by joining the 2 tables.

I want to do integration testing for my API end points. To do that, I spin up a local database, run migration of the database schemas to create the tables, then populate each table with artificial records such that I know exactly what is in the database. But coming up with a good seeding process has proven to be anything but easy. For the simple example I described above, my current approach involves generating multiple distinct records for each column. For example, I need at least 2 campuses (say main and satellite), and 3 departments (say Electrical Engineering and Mathematics for main campus and English for satellite campus). Then I need at least 2 students in each department or 6 students in total. And if I mix in gender, age and rank, you can easily see that the number of artificial records grows exponentially. Coming up with all these artificial records is manual and thus tedious to maintain.

So my question is: what is the proper way to set up and seed database for integration testing in general?

like image 694
breezymri Avatar asked Oct 30 '17 16:10

breezymri


People also ask

What is seeding of data in database?

Data seeding is the process of populating a database with an initial set of data. There are several ways this can be accomplished in EF Core: Model seed data. Manual migration customization. Custom initialization logic.

Why do we seed a database?

Having a data seeding mechanism can make performance tests and database management more accurate, efficient, and manageable. As an engineer, you might want your application to start with a minimum set of data that is already loaded and sanitized, or you might need a specific set of data for testing.

How do I create a data seeder?

You can either create a new seeder using ./alpas make:seeder <seeder name> command or just override the existing database seeder if you only plan to call one seeder. You can run the seeder by using the ./alpas db:seed command which will run the database seeder by default.


2 Answers

First, I do not know any public tool that automates the task of generating test data for arbitrary scenarios.

Actually, this is a hard task in general. You might look for scientific papers and books on the topic. There are may of those. Unfortunately, I have no recommendation on a set of "good" ones.

A quite trivial approach is generating random data drawn from a set of potential values per field (column in the database case). (This is what you did already.) For smaller sets you may even generate the full set of potential combinations. E.g. you might have a look at the following test data generator for an example applying a variant of such an approach.

However, this might not be appropriate for the following reasons:

  • the resulting data will exhibit significant redundancy, while it may still not cover all interesting cases.
  • it might create inconsistent data with respect to logical constraints your application would enforce otherwise (e.g. referential integrity)

You might address such issues by adding some constraints into the process of generating test data for eliminating invalid or redundant combinations (with respect to your application).

The actual restriction possible (and making sense), however, are depending on your business and use cases. So, there is no general rule on such restrictions. E.g. if your API provides special treatment for age values based on gender combinations of age and gender are important for your tests, if no such distinction exists any combination of age and gender will be OK.

As long as you are looking for white box test scenarios, you will need to put in your implementation (or at least specification) details.

For black box testing a full set of combinatorial data will be sufficient. Then only reducing test data to keep runtime of tests within some maximum is an issue.

When dealing with white box testing, you might explicitly look for adding in corner cases. E.g. in your case: department without any student, department with a single student, students without department, as long as such scenario makes sense with your testing purposes. (e.g. when testing error handling or testing how your application would deal with inconsistent data.)

In your case you are looking at your API as the main view to the data. The database content just is the input necessary for achieving all interesting output from that API. The actual task of identifying a proper database content may be described by the mathematical problem of providing an inverse to the mapping provided by your application (from database content to API result).

In lack of any ready tool, you might apply the following steps:

  1. start with a simple combinatorial data generator
  2. apply some restrictions eliminating useless or illegal records
  3. run tests capturing coverage data add extra data records for improving coverage repeat testing until coverage is OK

  4. review and adjust data after any change to your code or schema

like image 95
rpy Avatar answered Nov 03 '22 07:11

rpy


I think DbUnit might be the right tool for what you're trying to do. You can specify the state of your database before the tests and check the expected state after.

like image 36
Florian Wilhelm Avatar answered Nov 03 '22 06:11

Florian Wilhelm