When testing small strings (e.g. isPhoneNumber or isHexadecimal) is there a performance benefit from using regular expressions, or would brute forcing them be faster? Wouldn't brute forcing them by just checking whether or not the given string's chars are within a specified range be faster than using a regex?
For example:
public static boolean isHexadecimal(String value)
{
if (value.startsWith("-"))
{
value = value.substring(1);
}
value = value.toLowerCase();
if (value.length() <= 2 || !value.startsWith("0x"))
{
return false;
}
for (int i = 2; i < value.length(); i++)
{
char c = value.charAt(i);
if (!(c >= '0' && c <= '9' || c >= 'a' && c <= 'f'))
{
return false;
}
}
return true;
}
vs.
Regex.match(/0x[0-9a-f]+/, "0x123fa") // returns true if regex matches whole given expression
There seems like there would be some overhead associated with the regex, even when the pattern is pre-compiled, just from the fact that regular expressions have to work in many general cases. In contrast, the brute-force method does exactly what is required and no more. Am I missing some optimization that regular expressions have?
A text can consist of pretty much anything from letters to numbers, space characters to special characters. As long as the string follows some sort of pattern, regex is robust enough to be able to capture this pattern and return a specific part of the string.
Regular Expression or (Regex) is one of the most powerful, flexible, and efficient text processing approaches. Regex has its own terminologies, conditions, and syntax; it is, in a sense, a mini programming language. Regex can be used to add, remove, isolate, and manipulate all kinds of text and data.
Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.
Checking whether string characters are within a certain range is exactly what regular expressions are built to do. They convert the expression into an atomic series of instructions; They're essentially writing out your manual parsing steps but at a lower level.
What tends to be slow with regular expressions is the conversion of the expression into instructions. You can see real performance gains when a regex is used more than once. That's when you can compile the expression ahead of time and then simply apply the resulting compiled instructions in a match, search, replace, etc.
As is the case with anything to do with performance, perform some tests and measure the results.
I've written a small benchmark to estimate the performance of the:
The test machine configuration is as follows:
And here are the results I got for the original test string "0x123fa" and 10.000.000 iterations:
Method "NOP" => #10000000 iterations in 9ms
Method "isHexadecimal (OP)" => #10000000 iterations in 300ms
Method "RegExp" => #10000000 iterations in 4270ms
Method "RegExp (Compiled)" => #10000000 iterations in 1025ms
Method "isHexadecimal (maraca)" => #10000000 iterations in 135ms
Method "fastIsHex" => #10000000 iterations in 107ms
as you can see even the original method by the OP is faster than the RegExp method (at least when using JDK-provided RegExp implementation).
(for your reference)
Benchmark code:
public static void main(String[] argv) throws Exception {
//Number of ITERATIONS
final int ITERATIONS = 10000000;
//NOP
benchmark(ITERATIONS,"NOP",() -> nop(longHexText));
//isHexadecimal
benchmark(ITERATIONS,"isHexadecimal (OP)",() -> isHexadecimal(longHexText));
//Un-compiled regexp
benchmark(ITERATIONS,"RegExp",() -> longHexText.matches("0x[0-9a-fA-F]+"));
//Pre-compiled regexp
final Pattern pattern = Pattern.compile("0x[0-9a-fA-F]+");
benchmark(ITERATIONS,"RegExp (Compiled)", () -> {
pattern.matcher(longHexText).matches();
});
//isHexadecimal (maraca)
benchmark(ITERATIONS,"isHexadecimal (maraca)",() -> isHexadecimalMaraca(longHexText));
//FastIsHex
benchmark(ITERATIONS,"fastIsHex",() -> fastIsHex(longHexText));
}
public static void benchmark(int iterations,String name,Runnable block) {
//Start Time
long stime = System.currentTimeMillis();
//Benchmark
for(int i = 0; i < iterations; i++) {
block.run();
}
//Done
System.out.println(
String.format("Method \"%s\" => #%d iterations in %dms",name,iterations,(System.currentTimeMillis()-stime))
);
}
NOP method:
public static boolean nop(String value) { return true; }
fastIsHex method:
public static boolean fastIsHex(String value) {
//Value must be at least 4 characters long (0x00)
if(value.length() < 4) {
return false;
}
//Compute where the data starts
int start = ((value.charAt(0) == '-') ? 1 : 0) + 2;
//Check prefix
if(value.charAt(start-2) != '0' || value.charAt(start-1) != 'x') {
return false;
}
//Verify data
for(int i = start; i < value.length(); i++) {
switch(value.charAt(i)) {
case '0':case '1':case '2':case '3':case '4':case '5':case '6':case '7':case '8':case '9':
case 'a':case 'b':case 'c':case 'd':case 'e':case 'f':
case 'A':case 'B':case 'C':case 'D':case 'E':case 'F':
continue;
default:
return false;
}
}
return true;
}
So, the answer is no, for short-strings and the task at hand, RegExp is not faster.
When it comes to a longer strings, the balance is quite different, below are results for the 8192 long hex string, I've generated with:
hexdump -n 8196 -v -e '/1 "%02X"' /dev/urandom
and 10.000 iterations:
Method "NOP" => #10000 iterations in 2ms
Method "isHexadecimal (OP)" => #10000 iterations in 1512ms
Method "RegExp" => #10000 iterations in 1303ms
Method "RegExp (Compiled)" => #10000 iterations in 1263ms
Method "isHexadecimal (maraca)" => #10000 iterations in 553ms
Method "fastIsHex" => #10000 iterations in 530ms
As you can see, hand-written methods (the one by macara and my fastIsHex), still beat the RegExp, but original method does not, (due to substring() and toLowerCase()).
Sidenote:
This benchmark is very simple indeed and only tests the "worst case" scenario (i.e. a fully valid string), a real life results, with the mixed data lengths and a non-0 valid-invalid ratio, might be quite different.
Update:
I also gave a try to the char[] array version:
char[] chars = value.toCharArray();
for (idx += 2; idx < chars.length; idx++) { ... }
and it was even a bit slower than getCharAt(i) version:
Method "isHexadecimal (maraca) char[] array version" => #10000000 iterations in 194ms
Method "fastIsHex, char[] array version" => #10000000 iterations in 164ms
my guess is that is due to array copy inside toCharArray.
Update (#2):
I've run an additional 8k/100.000 iterations test to see if there is any real difference in speed between the "maraca" and "fastIsHex" methods, and have also normalized them to use exactly the same precondition code:
Run #1
Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5341ms
Method "fastIsHex" => #100000 iterations in 5313ms
Run #2
Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5313ms
Method "fastIsHex" => #100000 iterations in 5334ms
I.e. the speed difference between these two methods is marginal at best, and is probably due to a measurement error (as I'm running this on my workstation and not a specially setup clean test environment).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With