Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

with regex, is using both "is" and "is not" range definitons within the same range possible?

Tags:

java

regex

Note: I'm using a 3rd party app that uses regex for searches which has its own flavor but almost always works like java's flavor of regex. Of course this may not matter.

After searching for many different ways of this same question (phrased many ways), I did not see any tutorials, examples, or even mentions of whether it is possible to use both an "is" (positive?) and "is not" (negative?) definition within the same range.

I can't run a test the example right now in the app to see if my ideas work, because the amount of data being searched is massive and will screw up the matches it has already gathered. I'm only asking because of this.

Here are examples of what I thought might work but caused tester to act weird:

[\w^\s<>.!?]{2}
[\w|^\s<>.!?]{2}

I would rather have it work the way I think the first one would work (any digit, lower case, or upper case character, or other normal character that is not a space, >, <, period, !, or ?) rather then the second which only has an or operator.

The regex testers I used gave me different funky results which is what is confusing me.

Also note: I'm using this within a capture group which is followed by a catch everything match which I may or may not be using properly. So if you'd like to include how to follow what I'm attempting with how to properly do that, feel free. I AM MAINLY JUST CURIOUS TO IF THIS WAS POSSIBLE OR NOT, OR IF IT WAS A IMPROPER METHOD.

like image 281
Travis Crum Avatar asked Oct 08 '12 15:10

Travis Crum


2 Answers

Why do you need the \w at all?

[^\s<>.!?]{2}

This already matches all alphanumeric characters since they are neither space nor any of the punctuation characters you mentioned.

In general, you can substract character classes to some degree, for example, to match alphanumerics exluding digits, you can do

[^\W\d]

because [^\W] matches the same as \w, and \d is substracted from that because it's in a negated character class.

Edit:

Some regex engines (like XPath, .NET and JGSoft) allow flexible character class substraction like this:

[a-z-[e-g]]

to match any character from the range [a-z], excluding e, f and g. But Java does not have this feature.

like image 68
Tim Pietzcker Avatar answered Nov 14 '22 22:11

Tim Pietzcker


Another possibility is to use two ranges and combine them; e.g.

([\w]|[^\s<>.!?]){2}

However, this does bring up the question of what you are actually trying to express here. Because this example (as I've rewritten it) doesn't make a lot of sense.

What it says is "a word character, or any character that is not whitespace or certain punctuation". But the class of characters that are not "whitespace or certain punctuation" ALREADY includes all of the word characters. So, unless you mean something different, the \w is redundant.

like image 27
Stephen C Avatar answered Nov 14 '22 22:11

Stephen C