Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex optimisation - escaping ampersands in java

I need to replace all & in a String that isnt part of a HTML entity. So that the String "This & entites > & <" will return "This & entites > & <"

And I've come up with this regex-pattern: "&[a-zA-Z0-9]{2,7};" which works fine. But I'm not very skilled in regex, and when I test the speed over 100k iterations, it uses double amount of time over a previous used method, that didnt use regex. (But werent working 100% either).

Testcode:

long time = System.currentTimeMillis();
String reg = "&(?!&#?[a-zA-Z0-9]{2,7};)";
String s="a regex test 1 & 2  1&2 and &_gt; - &_lt;"
for (int i = 0; i < 100000; i++) {test=s.replaceAll(reg, "&amp;");}
System.out.println("Finished in:" + (System.currentTimeMillis() - time) + " milliseconds");

So the question would be whether there is some obvious ways of optimize this regex expression for it to be more effective?

like image 992
Duveit Avatar asked May 11 '09 13:05

Duveit


People also ask

Do I need to escape ampersand in regex?

Thanks. @RandomCoder_01 actually, no escapes are needed. & is not a special regex character, so no need to escape it.

How do you escape a character in regex Java?

We can use a backslash to escape characters. We require two backslashes as backslash is itself a character and needs to be escaped. Characters after \\ are escaped. It is generally used to escape characters at the end of the string.

How do you escape a backslash in Java regex?

However, backslash is also an escape character in Java literal strings. To make a regular expression from a string literal, you have to escape each of its backslashes. In a string literal '\\\\' can be used to create a regular expression with '\\', which in turn can match '\'.

How do you exit a hyphen in Java?

replaceAll("\\-", "\\-\\");


2 Answers

s.replaceAll(reg, "&amp;") is compiling the regular expression every time. Compiling the pattern once will provide some increase in performance (~30% in this case).

long time = System.currentTimeMillis();
String reg = "&(?!&#?[a-zA-Z0-9]{2,7};)";
Pattern p = Pattern.compile(reg);
String s="a regex test 1 & 2  1&2 and &_gt; - &_lt;";
for (int i = 0; i < 100000; i++) {
    String test = p.matcher(s).replaceAll("&amp;");
}
System.out.println("Finished in:" + 
             (System.currentTimeMillis() - time) + " milliseconds");
like image 146
Chris Thornhill Avatar answered Oct 07 '22 17:10

Chris Thornhill


You have to exclude the & from your look-ahead assertion. So try this regular expression:

&(?!#?[a-zA-Z0-9]{2,7};)

Or to be more precise:

&(?!(?:#(?:[xX][0-9a-fA-F]|[0-9]+)|[a-zA-Z]+);)
like image 23
Gumbo Avatar answered Oct 07 '22 17:10

Gumbo