For testing purposes on a project I'm working on, I have a need to, if given a regular expression, randomly generate a string that will FAIL to be matched by it. For instance, if I'm given this regex: <pre class="prettyprint"><code>^[abcd]d+ </code></pre> Then I should be able to generate strings such as: <pre class="prettyprint"><code>hnbbad uduebbaef 9f8;djfew skjcc98332f </code></pre> ...each of which does NOT match the regex, but NOT generate: <pre class="prettyprint"><code>addr32 bdfd09usdj cdddddd-9fdssee </code></pre> ...each of which DO. In other words, I want something like an anti-Xeger. Does such a library exist, preferably in Python (if I can understand the theory, I can most likely convert it to Python if need be)? I gave some thought to how I could write this, but given the scope of regular expressions, it seemed that might be a much harder problem than what things like Xeger can tackle. I also looked around for a pre-made library to do this, but either I'm not using the right keywords to search or nobody's had this problem before.

My initial instinct is, no, such a library does not exist because it's not possible. You can't be sure that you can find a valid input for any arbitrary regular expression in a reasonable amount of time. For example, proving whether a number is prime is believed to be a hard to solve mathematical problem. The following regular expression matches any string which is at least 10000 characters long and whose total length is a prime number: <pre class="prettyprint"><code>(?!(..+)\1+$).{10000} </code></pre> I doubt that any library exists that can find a valid input to this regular expression in reasonable time. And this is a very easy example with a simple solution, e.g. <code>'x' * 10007</code> will work. It would be possible to come up with other regular expressions that are much harder to find valid inputs for. I think the only way you are going to solve this is if you limit yourself to some subset of all possible regular expressions. <hr> But having said that if you have a magical library that generates text that matches for any arbitrary regular expression then all you need to do is generate a regular expression that matches all the strings that don't match your original expression. Luckily this is possible using a negative lookahead: <pre class="prettyprint"><code>^(?![\s\S]*(?:^[abcd]d+)) </code></pre> <hr> If you are willing to change the requirements to only allow a limited subset of regular expressions then you can negate the regular expression by using boolean logic. For example if <code>^[abcd]d+</code> becomes <code>^[^abcd]|^[abcd][^d]</code>. It is then possible to find a valid input for this regular expression in reasonable time.

Randomly generate a string that does NOT match a given regular expression

Tags:

python

regex

For testing purposes on a project I'm working on, I have a need to, if given a regular expression, randomly generate a string that will FAIL to be matched by it. For instance, if I'm given this regex:

^[abcd]d+

Then I should be able to generate strings such as:

hnbbad
uduebbaef
9f8;djfew
skjcc98332f

...each of which does NOT match the regex, but NOT generate:

addr32
bdfd09usdj
cdddddd-9fdssee

...each of which DO. In other words, I want something like an anti-Xeger.

Does such a library exist, preferably in Python (if I can understand the theory, I can most likely convert it to Python if need be)? I gave some thought to how I could write this, but given the scope of regular expressions, it seemed that might be a much harder problem than what things like Xeger can tackle. I also looked around for a pre-made library to do this, but either I'm not using the right keywords to search or nobody's had this problem before.

206

asked Nov 09 '12 21:11

CaptainSpam

1 Answers

My initial instinct is, no, such a library does not exist because it's not possible. You can't be sure that you can find a valid input for any arbitrary regular expression in a reasonable amount of time.

For example, proving whether a number is prime is believed to be a hard to solve mathematical problem. The following regular expression matches any string which is at least 10000 characters long and whose total length is a prime number:

(?!(..+)\1+$).{10000}

I doubt that any library exists that can find a valid input to this regular expression in reasonable time. And this is a very easy example with a simple solution, e.g. 'x' * 10007 will work. It would be possible to come up with other regular expressions that are much harder to find valid inputs for.

I think the only way you are going to solve this is if you limit yourself to some subset of all possible regular expressions.

But having said that if you have a magical library that generates text that matches for any arbitrary regular expression then all you need to do is generate a regular expression that matches all the strings that don't match your original expression.

Luckily this is possible using a negative lookahead:

^(?![\s\S]*(?:^[abcd]d+))

If you are willing to change the requirements to only allow a limited subset of regular expressions then you can negate the regular expression by using boolean logic. For example if ^[abcd]d+ becomes ^[^abcd]|^[abcd][^d]. It is then possible to find a valid input for this regular expression in reasonable time.

147

answered Sep 30 '22 17:09

Mark Byers

Related questions
                            
                                How to read all articles from a RSS feed?
                            
                                condense pyqtproperties
                            
                                Subclass variables with the same name of superclass ones
                            
                                NumPy: 3-byte, 6-byte types (aka uint24, uint48)
                            
                                SWIG Python Structure Array
                            
                                Subclass not inheriting parent class
                            
                                Python - gtk3 add stock icons to Gtk.Buttons
                            
                                How to make a function return after 5 seconds passed in Python?
                            
                                Multiplying elements in a sparse array with rows in matrix
                            
                                Pipe STDIN to a script that is itself being piped to the Python interpreter?
                            
                                Enabling Code Completion in an embedded Python Interpreter
                            
                                List instances in auto scaling group with boto
                            
                                Issue creating table in MS SQL Database with Python script
                            
                                Insert xml element as first child using ElementTree in python
                            
                                TypeError: __init__() takes at least 2 arguments (1 given) error
                            
                                libpng warning: interlace handling should be turned on when using png_read_image in Python/PyGame
                            
                                How to test DateTimeProperty in App Engine NDB?
                            
                                Interfacing with TUN\TAP for MAC OSX (Lion) using Python
                            
                                List of numbers whose squares are the sum of two squares
                            
                                Why does nosetests say --with-coverage is not an option?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With