Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove all non-word char except if & or ' pattern

Tags:

java

regex

I am trying to clean a string of all non-word character except when it is & i.e. pattern might be like &[\w]+;

For example:

abc; => abc
abc & => abc &
abc& => abc  

if i use string.replaceAll("\W","") it removes ; and '&' too from second example which I don't want.

Can using negative look-ahead in this problem could give a quick solution regex pattern?

like image 619
Watt Avatar asked Feb 14 '13 18:02

Watt


1 Answers

First of all, I really like the question. Now, what you want could not be done with a single replaceAll, because for that, we would need a negative look-behind with variable length, which is not allowed. If it was allowed, then it would not have been that difficult.

Anyways, since single replaceAll is no option here, you can use a little hack here. Like first replacing the last semi-colon of you entity reference, with some character sequence, which you are sure won't be there in the rest of the string, like XXX or anything. I know this is not correct, but you sure can't help it out.

So, here's what you can try:

String str = "a;b&c &";

str  = str.replaceAll("(&\\w+);", "$1XXX")
          .replaceAll("&(?!\\w+?XXX)|[^\\w&]", "")
          .replaceAll("(&\\w+)XXX", "$1;");

System.out.println(str);

Explanation:

  • The first replaceAll, replaces the pattern like & with &ampXXX, or any other sequence replaced for last ;.
  • The second replaceAll, replaces any & not followed by \\w+XXX, or any non-word, non & character. This will replace all the &'s which are not a part of & kind of pattern. Plus, also replaces any other non-word character.
  • The third replaceAll, re-replaces XXX with ;, to create back & from &ampXXX

And to make it easier to understand, you can rather use Pattern and Matcher classes and I would always prefer to use them whenever the replacement criteria is complex.

String str = "a;b&c &";

Pattern pattern = Pattern.compile("&\\w+;|[^\\w]");
Matcher matcher = pattern.matcher(str);

StringBuilder sb = new StringBuilder();

while (matcher.find()) {
    String match = matcher.group();
    if (!match.matches("&\\w+;")) {
        matcher.appendReplacement(sb, "");
    } else {
        matcher.appendReplacement(sb, match);
    }
}
matcher.appendTail(sb);
System.out.println(sb.toString());

This one is similar to @Eric's code, but is a generalization over it. That one will only work for & of course if it was improved to remove NullPointerException that is thrown in it.

like image 182
Rohit Jain Avatar answered Sep 20 '22 15:09

Rohit Jain