Regex vs brute-force for small strings

Tags:

When testing small strings (e.g. isPhoneNumber or isHexadecimal) is there a performance benefit from using regular expressions, or would brute forcing them be faster? Wouldn't brute forcing them by just checking whether or not the given string's chars are within a specified range be faster than using a regex?

For example:

public static boolean isHexadecimal(String value)
{
    if (value.startsWith("-"))
    {
        value = value.substring(1);
    }

    value = value.toLowerCase();

    if (value.length() <= 2 || !value.startsWith("0x"))
    {
        return false;
    }

    for (int i = 2; i < value.length(); i++)
    {
        char c = value.charAt(i);

        if (!(c >= '0' && c <= '9' || c >= 'a' && c <= 'f'))
        {
            return false;
        }
    }

    return true;
}

vs.

Regex.match(/0x[0-9a-f]+/, "0x123fa") // returns true if regex matches whole given expression

There seems like there would be some overhead associated with the regex, even when the pattern is pre-compiled, just from the fact that regular expressions have to work in many general cases. In contrast, the brute-force method does exactly what is required and no more. Am I missing some optimization that regular expressions have?

608

asked Oct 23 '16 18:10

Braden Steffaniak

2 Answers

Checking whether string characters are within a certain range is exactly what regular expressions are built to do. They convert the expression into an atomic series of instructions; They're essentially writing out your manual parsing steps but at a lower level.

What tends to be slow with regular expressions is the conversion of the expression into instructions. You can see real performance gains when a regex is used more than once. That's when you can compile the expression ahead of time and then simply apply the resulting compiled instructions in a match, search, replace, etc.

As is the case with anything to do with performance, perform some tests and measure the results.

answered Sep 17 '22 12:09

Soviut

I've written a small benchmark to estimate the performance of the:

NOP method (to get an idea of the baseline iteration speed);
Original method, as provided by the OP ;
RegExp;
Compiled Regexp;
The version provided by @maraca (w/o toLowerCase and substring);
"fastIsHex" version (switch-based), I've added just for fun.

The test machine configuration is as follows:

JVM: Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
CPU: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz

And here are the results I got for the original test string "0x123fa" and 10.000.000 iterations:

Method "NOP" => #10000000 iterations in 9ms
Method "isHexadecimal (OP)" => #10000000 iterations in 300ms
Method "RegExp" => #10000000 iterations in 4270ms
Method "RegExp (Compiled)" => #10000000 iterations in 1025ms
Method "isHexadecimal (maraca)" => #10000000 iterations in 135ms
Method "fastIsHex" => #10000000 iterations in 107ms

as you can see even the original method by the OP is faster than the RegExp method (at least when using JDK-provided RegExp implementation).

(for your reference)

Benchmark code:

public static void main(String[] argv) throws Exception {
    //Number of ITERATIONS
    final int ITERATIONS = 10000000;

    //NOP
    benchmark(ITERATIONS,"NOP",() -> nop(longHexText));

    //isHexadecimal
    benchmark(ITERATIONS,"isHexadecimal (OP)",() -> isHexadecimal(longHexText));

    //Un-compiled regexp
    benchmark(ITERATIONS,"RegExp",() -> longHexText.matches("0x[0-9a-fA-F]+"));

    //Pre-compiled regexp
    final Pattern pattern = Pattern.compile("0x[0-9a-fA-F]+");
    benchmark(ITERATIONS,"RegExp (Compiled)", () -> {
        pattern.matcher(longHexText).matches();
    });

    //isHexadecimal (maraca)
    benchmark(ITERATIONS,"isHexadecimal (maraca)",() -> isHexadecimalMaraca(longHexText));

    //FastIsHex
    benchmark(ITERATIONS,"fastIsHex",() -> fastIsHex(longHexText));
}

public static void benchmark(int iterations,String name,Runnable block) {
    //Start Time
    long stime = System.currentTimeMillis();

    //Benchmark
    for(int i = 0; i < iterations; i++) {
        block.run();
    }

    //Done
    System.out.println(
        String.format("Method \"%s\" => #%d iterations in %dms",name,iterations,(System.currentTimeMillis()-stime))
    );
}

NOP method:

public static boolean nop(String value) { return true; }

fastIsHex method:

public static boolean fastIsHex(String value) {

    //Value must be at least 4 characters long (0x00)
    if(value.length() < 4) {
        return false;
    }

    //Compute where the data starts
    int start = ((value.charAt(0) == '-') ? 1 : 0) + 2;

    //Check prefix
    if(value.charAt(start-2) != '0' || value.charAt(start-1) != 'x') {
        return false;
    }

    //Verify data
    for(int i = start; i < value.length(); i++) {
        switch(value.charAt(i)) {
            case '0':case '1':case '2':case '3':case '4':case '5':case '6':case '7':case '8':case '9':
            case 'a':case 'b':case 'c':case 'd':case 'e':case 'f':
            case 'A':case 'B':case 'C':case 'D':case 'E':case 'F':
                continue;

            default:
                return false;
        }
    }

    return true;
}

So, the answer is no, for short-strings and the task at hand, RegExp is not faster.

When it comes to a longer strings, the balance is quite different, below are results for the 8192 long hex string, I've generated with:

hexdump -n 8196 -v -e '/1 "%02X"' /dev/urandom

and 10.000 iterations:

Method "NOP" => #10000 iterations in 2ms
Method "isHexadecimal (OP)" => #10000 iterations in 1512ms
Method "RegExp" => #10000 iterations in 1303ms
Method "RegExp (Compiled)" => #10000 iterations in 1263ms
Method "isHexadecimal (maraca)" => #10000 iterations in 553ms
Method "fastIsHex" => #10000 iterations in 530ms

As you can see, hand-written methods (the one by macara and my fastIsHex), still beat the RegExp, but original method does not, (due to substring() and toLowerCase()).

Sidenote:

This benchmark is very simple indeed and only tests the "worst case" scenario (i.e. a fully valid string), a real life results, with the mixed data lengths and a non-0 valid-invalid ratio, might be quite different.

Update:

I also gave a try to the char[] array version:

 char[] chars = value.toCharArray();
 for (idx += 2; idx < chars.length; idx++) { ... }

and it was even a bit slower than getCharAt(i) version:

  Method "isHexadecimal (maraca) char[] array version" => #10000000 iterations in 194ms
  Method "fastIsHex, char[] array version" => #10000000 iterations in 164ms

my guess is that is due to array copy inside toCharArray.

Update (#2):

I've run an additional 8k/100.000 iterations test to see if there is any real difference in speed between the "maraca" and "fastIsHex" methods, and have also normalized them to use exactly the same precondition code:

Run #1

Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5341ms
Method "fastIsHex" => #100000 iterations in 5313ms

Run #2

Method "isHexadecimal (maraca) *normalized" => #100000 iterations in 5313ms
Method "fastIsHex" => #100000 iterations in 5334ms

I.e. the speed difference between these two methods is marginal at best, and is probably due to a measurement error (as I'm running this on my workstation and not a specially setup clean test environment).

answered Sep 18 '22 12:09

zeppelin

Related questions
                            
                                regex split string at first line break
                            
                                Regular expression for a string that must contain minimum 14 characters, where at minimum 2 are numbers, and at minimum 6 are letters
                            
                                how to get 1-100 using regex
                            
                                Match only A or B case insensitive
                            
                                POSIX character equivalents in Java regular expressions
                            
                                Regex - Get string between two words that doesn't contain word
                            
                                Scrabble word finder with wildcards
                            
                                Regex is not a valid type or namespace C#
                            
                                Notepad++ explicit quantifier notation
                            
                                PHP Regex validate letters and Spanish accent
                            
                                Java regex pattern to remove a parameter from query string
                            
                                Detecting iOS Version Number from User Agent using Regular Expressions
                            
                                To check if a string is alphanumeric in javascript
                            
                                Javascript regex to match fully qualified domain name, without protocol, optional subdomain
                            
                                Split String By Character
                            
                                python: how to find consecutive pairs of letters by regex?
                            
                                split(/\s+/).pop() - what does it do?
                            
                                Count parentheses with regular expression
                            
                                How to match start or end of given string using regex in java
                            
                                python group(0) meaning

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regex vs brute-force for small strings

Tags:

performance

string

regex

brute-force

Braden Steffaniak

People also ask

2 Answers

Soviut

zeppelin

Recent Activity

Donate For Us