Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can a regular expression be tested to see if it reduces to .*

I'm developing an application where users enter a regular expression as a filter criterion, however I do not want people to be (easily) able to enter .* (i.e. match anything). The problem is, if I just use if (expression == ".*"), then this could be easily sidestepped by entering something such as .*.*.

Does anyone know of a test that could take a piece of regex and see if is essentially .* but in a slightly more elaborate form?

My thoughts are:

  1. I could see if the expression is one or more repetitions of .*, (i.e. if it matches (\.\*)+ (quotations/escapes may not be entirely accurate, but you get the idea). The problem with this is that there may be other forms of writing a global match (e.g. with $ and ^) that are too exhaustive to even think of upfront, let along test.

  2. I could test a few randomly generated Strings with it and assume that if they all pass, the user has entered a globally matching pattern. The problem with this approach is that there could be situations where the expression is sufficiently tight and I just pick bad strings to match against.

Thoughts, anyone?

(FYI, the application is in Java but I guess this is more of an algorithmic question than one for a particular language.)

like image 694
user1056788 Avatar asked Nov 20 '11 20:11

user1056788


2 Answers

Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions

You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html

If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.

like image 134
Raymond Hettinger Avatar answered Oct 27 '22 15:10

Raymond Hettinger


There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.

These are some examples that would match the same than .* (they will additionally match the newline characters)

/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/

So I assume your idea 2 would be a lot easier to achieve.

like image 30
stema Avatar answered Oct 27 '22 17:10

stema