I am trying to create a Python function that can take an plain English description of a regular expression and return the regular expression to the caller.
Currently I am thinking of the description in YAML format. So, we can store the description as a raw string variable, which is passed on to this another function and output of that function is then passed to the 're' module. Following is a rather simplistic example:
# a(b|c)d+e*
re1 = """
- literal: 'a'
- one_of: 'b,c'
- one_or_more_of: 'd'
- zero_or_more_of: 'e'
"""
myre = re.compile(getRegex(re1))
myre.search(...)
etc.
Does anyone think something of this sort would be of wider use? Do you know already existing packages that can do it? What are the limitations that you see to this approach? Does anyone think, having the declarative string in code, would make it more maintainable?
Regular expressions are particularly useful for defining filters. Regular expressions contain a series of characters that define a pattern of text to be matched—to make a filter more specialized, or general.
There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular expression. A few utilities like awk and egrep use the extended expression. Most use the "basic" regular expression. From now on, if I talk about a "regular expression," it describes a feature in both types.
The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.
A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations.
This is actually pretty similar (identical?) to how a lexer/parser works. If you had a defined grammar then you could probably write a parser with not too much trouble. For instance, you could write something like this:
<expression> :: == <rule> | <rule> <expression> | <rule> " followed by " <expression>
<rule> :: == <val> | <qty> <val>
<qty> :: == "literal" | "one" | "one of" | "one or more of" | "zero or more of"
<val> :: == "a" | "b" | "c" | "d" | ... | "Z" |
That's nowhere near a perfect description. For more info, take a look at this BNF of the regex language. You could then look at lexing and parsing the expression.
If you did it this way you could probably get a little closer to Natural Language/English versions of regexes.
I can see a tool like this being useful, but as was previously said, mainly for beginners. The main limitation to this approach would be in the amount of code you have to write to translate the language into regex (and/or vice versa). On the other hand, I think a two-way translation tool would actually be more ideal and see more use. Being able to take a regex and turn it into English might be a lot more helpful to spot errors.
Of course it doesn't take too long to pickup regex as the syntax is usually terse and most of the meanings are pretty self explanatory, at least if you use | or || as OR
in your language, and you think of * as multiplying by 0-N, + as adding 0-N.
Though sometimes I wouldn't mind typing "find one or more 'a' followed by three digits or 'b' then 'c'"
Please take a look at pyparsing. Many of the issues that you describe with RE's are the same ones that inspired me to write that package.
Here are some specific features of pyparsing from the O'Reilly e-book chapter "What's so special about pyparsing?".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With