Random string that matches a regexp [duplicate]

1 Answers

Welp, just musing, but the general question of generating random inputs that match a regex sounds doable to me for a sufficiently relaxed definition of random and a sufficiently tight definition of regex. I'm thinking of the classical formal definition, which allows only ()|* and alphabet characters.

Regular expressions can be mapped to formal machines called finite automata. Such a machine is a directed graph with a particular node called the final state, a node called the initial state, and a letter from the alphabet on each edge. A word is accepted by the regex if it's possible to start at the initial state and traverse one edge labeled with each character through the graph and end at the final state.

One could build the graph, then start at the final state and traverse random edges backwards, keeping track of the path. In a standard construction, every node in the graph is reachable from the initial state, so you do not need to worry about making irrecoverable mistakes and needing to backtrack. If you reach the initial state, stop, and read off the path going forward. That's your match for the regex.

There's no particular guarantee about when or if you'll reach the initial state, though. One would have to figure out in what sense the generated strings are 'random', and in what sense you are hoping for a random element from the language in the first place.

Maybe that's a starting point for thinking about the problem, though!

Now that I've written that out, it seems to me that it might be simpler to repeatedly resolve choices to simplify the regex pattern until you're left with a simple string. Find the first non-alphabet character in the pattern. If it's a *, replicate the preceding item some number of times and remove the *. If it's a |, choose which of the OR'd items to preserve and remove the rest. For a left paren, do the same, but looking at the character following the matching right paren. This is probably easier if you parse the regex into a tree representation first that makes the paren grouping structure easier to work with.

To the person who worried that deciding if a regex actually matches anything is equivalent to the halting problem: Nope, regular languages are quite well behaved. You can tell if any two regexes describe the same set of accepted strings. You basically make the machine above, then follow an algorithm to produce a canonical minimal equivalent machine. Do that for two regexes, then check if the resulting minimal machines are equivalent, which is straightforward.

186

answered Sep 21 '22 05:09

Ken

Related questions
                            
                                Difference between regex_match and regex_search?
                            
                                Why can regular expressions have an exponential running time?
                            
                                Javascript regex to validate IPv4 and IPv6 address, no hostnames
                            
                                regex for n characters or at least m characters
                            
                                Perl Regex "Not" (negative lookahead)
                            
                                What does this regexp mean - "\p{Lu}"?
                            
                                How to escape asterisk in regexp?
                            
                                Fuzzy regular expressions
                            
                                Splitting strings through regular expressions by punctuation and whitespace etc in java
                            
                                Laravel pattern validation pipe character issue
                            
                                The Hostname Regex
                            
                                Java regex: Repeating capturing groups
                            
                                Php find string with regex
                            
                                Listing all files matching a full-path pattern in R
                            
                                Shouldn't "static" patterns always be static?
                            
                                Groovy Regex: Capture group in Switch Statement
                            
                                RegEx for including alphanumeric and special characters
                            
                                What are the valid characters for Registry keys and valuenames?
                            
                                How can I capture multiple matches from the same Perl regex?
                            
                                replacing all regex matches in single line

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Random string that matches a regexp [duplicate]

Tags:

language-agnostic

regex

random

Alvaro Rodriguez

People also ask

1 Answers

Ken

Recent Activity

Donate For Us