Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match a string that does not contain 'xxx'

Tags:

regex

One of my homework questions asked to develop a regex for all strings over x,y,z that did not contain xxx

After doing some reading I found out about negative lookahead and made this which works great:

(x(?!xx)|y|z)*

Still, in the spirit of completeness, is there anyway to write this without negative lookahead?

Reading I have done makes me think it can be done with some combination of carets (^), but I cannot get the right combination so I am not sure.

Taking it a step further, is it possible to exclude a string like xxx using only the or (|) operator, but still check the strings in a recursive fashion?

EDIT 9/6/2010:

Think I answered my own question. I messed with this some more, trying make this regex with only or (|) statements and I am pretty sure I figured it out... and it isn't nearly as messy as I thought it would be. If someone else has time to verify this with a human eye I would appreciate it.

(xxy|xxz|xy|xz|y|z)*(xxy|xxz|xx|xy|xz|x|y|z)

like image 296
ubiquibacon Avatar asked Dec 06 '25 09:12

ubiquibacon


2 Answers

Try this:

^(x{0,2}(y|z|$))*$

The basic idea is this: for match at most 2 X's, followed by another letter or the end of the string.

When you reach a point where you have 3 X's, the regex has no rule that allows it to keep matching, and it fails.

Working example: http://rubular.com/r/ePH0fHlZxL

A less compact way to write the same is (with free spaces, usually the /x flag):

^(
y|         # y is ok
z|         # so is z
x(y|z|$)|  # a single x, not followed by x
xx(y|z|$)  # 2 x's, not followed by x
)*$

Based on the latest edit, here's an ever flatter version of the pattern: I'm not entirely sure I understand your fascination with the pipe, but you can eliminate some more options - by allowing an empty match on the second group you don't need to repeat permutations from the first group. That regex also allows ε, which I think is included in your language.

^(xxy|xxz|xy|xz|y|z)*(xx|x|)$
like image 170
Kobi Avatar answered Dec 08 '25 23:12

Kobi


Basically you have the right answer already - well done you. :)

Carat (^) in a set [^abc] will only match where it does not find a character in that set so it's application for matching orders of characters (i.e. strings) is limited and weak.

Regex has numeric quantifiers {n} and {a,b} which allow you to match a defined number of repititions of a pattern, which would work for this specific pattern (because it's 'x' repeated) but it's not particularily expressive of the problem you're trying to solve (even for regex!) and is a bit brittle (it wouldn't be appropriate for negative match 'xyx' for example.

An or pattern again would be verbose and rather unexpressive but it could be done as the fragment:

(x|xx)[^x] // x OR xx followed by NOT x

Obviously you can do this with an iterative algorithm but that's highly inefficient compared to a regex.

Well done for thinking beyond the solution though.

like image 35
annakata Avatar answered Dec 09 '25 01:12

annakata