Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating sample data from regex to verify input strings by focussing on boundary cases defined in regex

There are several tools how to generate sample data for a given regex. Some include:

  • REX
  • Fare

However, while they may be sufficient to seed a dataset, it doesn't help much testing code that depends on the regex itself, such as validation.

Assume you have a code generator that generates a model with a property. The user specifies a regex to validate the property. Assume now that the code generator is attempting to generate tests to ensure that validation succeeds and fails appropriately. It seems reasonable for the tool to focus on boundary cases within the regex to avoid generating unnecessary data.

For example, consider a regex ^([a-z]{3,6})$ then the boundary cases include:

  • any string consisting only of [a-z] a length equal to 2 (failure)
  • any string consisting only of [a-z] a length equal to 3 (success)
  • any string consisting only of [a-z] a length equal to 4 (success)
  • any string consisting only of [a-z] a length equal to 5 (success)
  • any string consisting only of [a-z] a length equal to 6 (success)
  • any string consisting only of [a-z] a length equal to 7 (failure)
  • any string not consisting of [a-z] (failure)
  • any string not starting with [a-z] but ends with [a-z] (failure)
  • any string starting with [a-z] but not ending with [a-z] (failure)

The reason focussing on boundary cases is that any string consisting only of [a-z] with a length greater than 6 verifies the upper boundary of the string length defined in the regex. So testing a string of length 7, 8, 9 is really just testing the same (boundary) condition.

This was an arbitrary regex chosen for its simplicity, but any reasonable regex may act as an input.

Does a framework/tools exists that the code generator can use to generate input strings for test cases of the different layers of the systems being generated. The test cases come into their own when the system is no longer generated and modified later in the development cycle.

like image 988
bloudraak Avatar asked Jun 30 '12 21:06

bloudraak


1 Answers

If I understand your question correctly, you want to generate input for the system based on the validation regex so that you can automate unit testing.

Doesn't this defeat the purpose of unit testing, though? If someone changes the regex, wouldn't you want the validation to fail?

In any case, the simple answer is that generating a string from a regex is all but impossible. If it could be done, it would be extremely complex. For example, consider this regex:

(?<=\G\d{0,3})(?>[a-z]+)(?<=(?<foo>foo)|)(?(foo)(?!))

It is very simple for me to think of a string that would match (and/or generate matches):

abc123def456ghi789jkl123foo456pqr789stu123vwx456yz

The matches would be:

  • "abc"
  • "def"
  • "ghi"
  • "jkl"

But how would you generate a string from the expression? There is no clear starting point - it takes some extreme (for a computer) intelligence plus a dash of creativity to work out a solution. Something simple for a human, but very, very hard for a computer. Even if you could come up with a computer algorithm that would generate a matching string, it could easily look something like this:

a

This would generate a match, but it does a poor job of exercising the regex. The \d{0,3} is never really tried and \G is only ever used to match the beginning of the input (rather than the end of the last match). (?<=(?<foo>foo)) is never tested (and if it was, it would result in a non-match).

It would also be easy to generate a string that does not match:

1

But, again, this doesn't really put the regex through its paces.

I don't know computer theory well enough to prove it, but I believe this falls into the P v NP class of problems. It is relatively easy to generate a regex to match a collection of complex strings, but difficult to generate a collection of complex strings to match a regex.

like image 112
JDB Avatar answered Oct 04 '22 17:10

JDB