Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly generate a string that does NOT match a given regular expression

Tags:

python

regex

For testing purposes on a project I'm working on, I have a need to, if given a regular expression, randomly generate a string that will FAIL to be matched by it. For instance, if I'm given this regex:

^[abcd]d+

Then I should be able to generate strings such as:

hnbbad
uduebbaef
9f8;djfew
skjcc98332f

...each of which does NOT match the regex, but NOT generate:

addr32
bdfd09usdj
cdddddd-9fdssee

...each of which DO. In other words, I want something like an anti-Xeger.

Does such a library exist, preferably in Python (if I can understand the theory, I can most likely convert it to Python if need be)? I gave some thought to how I could write this, but given the scope of regular expressions, it seemed that might be a much harder problem than what things like Xeger can tackle. I also looked around for a pre-made library to do this, but either I'm not using the right keywords to search or nobody's had this problem before.

like image 206
CaptainSpam Avatar asked Nov 09 '12 21:11

CaptainSpam


People also ask

How do you exclude a regular expression?

To match any character except a list of excluded characters, put the excluded charaters between [^ and ] . The caret ^ must immediately follow the [ or else it stands for just itself.

What does \+ mean in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" .

How do you create a string in regex?

Just enter your regexp in the field below, press the Generate Text button, and you'll get a random text that matches your regexp. Press a button – invert a regexp. No ads, nonsense, or garbage.

What can be matched using (*) in a regular expression?

A regular expression followed by an asterisk ( * ) matches zero or more occurrences of the regular expression. If there is any choice, the first matching string in a line is used.


1 Answers

My initial instinct is, no, such a library does not exist because it's not possible. You can't be sure that you can find a valid input for any arbitrary regular expression in a reasonable amount of time.

For example, proving whether a number is prime is believed to be a hard to solve mathematical problem. The following regular expression matches any string which is at least 10000 characters long and whose total length is a prime number:

(?!(..+)\1+$).{10000}

I doubt that any library exists that can find a valid input to this regular expression in reasonable time. And this is a very easy example with a simple solution, e.g. 'x' * 10007 will work. It would be possible to come up with other regular expressions that are much harder to find valid inputs for.

I think the only way you are going to solve this is if you limit yourself to some subset of all possible regular expressions.


But having said that if you have a magical library that generates text that matches for any arbitrary regular expression then all you need to do is generate a regular expression that matches all the strings that don't match your original expression.

Luckily this is possible using a negative lookahead:

^(?![\s\S]*(?:^[abcd]d+))

If you are willing to change the requirements to only allow a limited subset of regular expressions then you can negate the regular expression by using boolean logic. For example if ^[abcd]d+ becomes ^[^abcd]|^[abcd][^d]. It is then possible to find a valid input for this regular expression in reasonable time.

like image 147
Mark Byers Avatar answered Sep 30 '22 17:09

Mark Byers