Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java regex to exclude specific strings from a larger one

I have been banging my head against this for some time now: I want to capture all [a-z]+[0-9]? character sequences excluding strings such as sin|cos|tan etc. So having done my regex homework the following regex should work:

(?:(?!(sin|cos|tan)))\b[a-z]+[0-9]?

As you see I am using negative lookahead along with alternation - the \b after the non-capturing group closing parenthesis is critical to avoid matching the in of sin etc. The regex makes sense and as a matter of fact I have tried it with RegexBuddy and Java as the target implementation and get the wanted result but it doesn't work using Java Matcher and Pattern objects! Any thoughts?

cheers

like image 527
nvrs Avatar asked Feb 03 '10 10:02

nvrs


2 Answers

The \b is in the wrong place. It would be looking for a word boundary that didn't have sin/cos/tan before it. But a boundary just after any of those would have a letter at the end, so it would have to be an end-of-word boundary, which is can't be if the next character is a-z.

Also, the negative lookahead would (if it worked) exclude strings like cost, which I'm not sure you want if you're just filtering out keywords.

I suggest:

\b(?!sin\b|cos\b|tan\b)[a-z]+[0-9]?\b

Or, more simply, you could just match \b[a-z]+[0-9]?\b and filter out the strings in the keyword list afterwards. You don't always have to do everything in regex.

like image 109
bobince Avatar answered Sep 30 '22 16:09

bobince


So you want [a-z]+[0-9]? (a sequence of at least one letter, optionally followed by a digit), unless that letter sequence resembles one of sin cos tan?

\b(?!(sin|cos|tan)(?=\d|\b))[a-z]+\d?\b

results:

cos   - no match
cosy  - full match
cos1  - no match
cosy1 - full match
bla9  - full match
bla99 - no match
like image 21
Tomalak Avatar answered Sep 30 '22 17:09

Tomalak