Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression for regular expressions? [duplicate]

Tags:

regex

php

Possible Duplicate:
Is there a regular expression to detect a valid regular expression?
Regular expression for finding a regular expression?

I have an application that enables the user to input a regular expression. How do I check against any input of regular expressions and make sure they are valid ones, because if they are not there there will be preg_match errors?

I don't want to use the '@' before preg_match, so if there's a way to check the validity of the user input of regular expressions that'd be great.

The regular expression system of PHP seems to be rather too complicated for me to come up with a regular expression for them.

like image 970
datasn.io Avatar asked May 07 '10 14:05

datasn.io


2 Answers

Mathematically, it is impossible to validate a regular expression using a regular expression. This is so because (formal) regular expressions can only recognize regular languages. A language is any set of strings. For example, the set of all decimal numbers is a language (which by the way can be described using a regular expression); the set of all valid regular expressions is also a language. Regular languages are languages that require only fixed finite memory (not a function of the input size) to be recognized.

The language that contains all valid regular expressions is not a regular language; hence it is impossible to recognize a regular expression using a regular expression.

To understand this, notice that regular expressions contain parentheses in them that must match. Hence, if an "(" has occurred, a ")" must occur later on. This is impossible to describe with a machine that has only fixed finite memory. For, if there were a way to do it, and your regular expression had a finite memory of K different states (for some integer K), an expression with K opening parentheses followed by K closing parentheses, though a valid regular expression would have been unable to be recognized by that machine -- a contradiction (notice that in formal languages, our assumption is that text processing occurs one character at a time, from left to right, which is the same for applied regular expressions). We call languages such as the one that describes regular expressions context-free and not regular.

(It is trivial to prove that regular expressions do not form a regular language using the Pumping Lemma)

So, there is a fundamental computer science problem in recognizing regular expressions using regular expressions: It is mathematically impossible to do so.

Regular languages are possible to be recognized by finite-state automata, i.e. machines with finite states but without memory. To overcome your problem, you need to add some memory which is dependant on the input size. Regular expressions, as they are context-free (fortunately they're not some obscure, hard-to-recognize type of language) can be recognized in linear time using a push-down automaton. This is a "for" loop that goes through the expression one token (usually a character) at a time and keeps track of what it's seen on a stack, i.e. it "pushes" data that it laters "pops" in a first-in-last-out fashion. (Example of data pushed to the stack: "I need to remember to find a matching `)' later on!"; you can "push" this as many times as you need; you can "pop" it later, when you need to check if you actually needed to have matched an opening parenthesis previously).

Of course, writing your own recognition engine for regular expressions would be a bit of an overhead -- but if you want to do it, you should know the above limitations. It would be more wise to employ an already existing mechanism to do it -- I suspect you could give that job to a regular expression library or a language that is more keen on handling regular expressions such as Perl; but the @-method doesn't sound like too bad of an idea after all: It may be slow, but your users may input terribly slow regular expressions anyway; and it may be a bad practice, but in your case it seems the best possible solution available.

Some related articles in Wikipedia:

  • Wikipedia: Regular_language
  • Wikipedia: Deterministic_finite_state_machine
  • Wikipedia: Regular_expression#Formal_language_theory
  • Wikipedia: Push-down_automaton
  • Wikipedia: Context-free_language
  • Wikipedia: Pumping_lemma_for_regular_languages
  • Wikipedia: LIFO

I hope this helped!

like image 107
dionyziz Avatar answered Oct 17 '22 05:10

dionyziz


preg_match() returns FALSE if an error occurred.

  1. send the expression to the server
  2. preg_match on an empty string
  3. see if an error occurs

You can either use Ajax to validate real time, or validate after form submit.
You can also try to validate by feeding the expression to javascript regexp engine, but js regexp syntax is not 100% compatible with the php one.

like image 44
clyfe Avatar answered Oct 17 '22 06:10

clyfe