Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

replace characters which do not match with the ones in a regex

Tags:

java

regex

I have this regex:

private static final String SPACE_PATH_REGEX ="[a-z|A-Z|0-9|\\/|\\-|\\_|\\+]+";

I check if my string matches this regex and IF NOT, i want to replace all characters which are not here, with "_".

I've tried like:

private static final String SPACE_PATH_REGEX_EXCLUDE =
        "[~a-z|A-Z|0-9|\\/|\\-|\\_|\\+]+";
if (myCompanyName.matches(SPACE_PATH_REGEX)) {
    myNewCompanySpaceName = myCompanyName;
} else{
    myNewCompanySpaceName = myCompanyName.replaceAll(
            SPACE_PATH_REGEX_EXCLUDE, "_");
}

but it does not work..., so in the 2nd regex "~" seems to not omit the following chars.

Any idea?

like image 803
Cristian Boariu Avatar asked Apr 09 '10 11:04

Cristian Boariu


2 Answers

You have several problems in your regex (see the Pattern class for the rules):

  • inside a character class | has no special meaning and should be removed without replacement in your case (unless you want your character class to include the literal | character).
  • Similarly you don't need to escape /, _ and + inside a character class.
  • - only needs to be escape if it's not the last character
  • ~ also has no special meaning in a character class it just represents itself
  • you will want to use ^ to negate the content of a character group.

You can also skip the first matches() check, as the replaceAll() call will return an unmodified String if nothing matches anyway. Keeping it (and the second regular expression) only serves to introduces another place where bugs could hide (for example you could accidentally update one regex, but not the other).

like image 168
Joachim Sauer Avatar answered Sep 23 '22 14:09

Joachim Sauer


Try:

final String SPACE_PATH_REGEX_EXCLUDE = "[^\\w~/\\-+]";
String out = in.replaceAll(SPACE_PATH_REGEX_EXCLUDE, "_");

The primary issue you have is that you are putting unnecessary |s into your pattern. They have a different meaning. Also, you can greatly simplify your expression by using \w, which means "word character", meaning letters (uppercase or lowercase), digits or underscore and is synonymous with [A-Za-z0-9_].

Also you need to understand how escaping works. There is Java string escaping, which is why you put in \\ to put one backslash into the pattern. But there is regex escaping too. For example \n is a Java String newline character and \\n is the sequence to put \n into a pattern to look for a newline character.

You have two convenient options for escaping a bunch of text:

  1. You can use \Q...\E. Anything between \Q and \E is escaped; and

  2. You can use Pattern.quote() to quote an arbitrary string.

Where you need to escape certain sequences in regexes is contextual. For example - only needs to be escaped if it can be mistaken for indicating a range. [a-z] is a lower case character. [a\-z] is one of a, - or z. But you can do -[a-z] to indicate a hyphen followed by a lowercase letter. Note: you don't need to escape the first hyphen.

like image 43
cletus Avatar answered Sep 22 '22 14:09

cletus