Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it more efficient to use a positive look-behind or a back reference?

Tags:

java

regex

Using the code of a question I just answered as an example

Starting with the string

30-Nov-2012 30-Nov-2012 United Kingdom, 31-Oct-2012 31-Oct-2012 United
Arab Emirates, 29-Oct-2012 31-Oct-2012 India 

What if we wanted to replace the space after every four-digit number with an @ such that you end up with something like this:

30-Nov-2012@30-Nov-2012@United Kingdom, 31-Oct-2012@31-Oct-2012@United
Arab Emirates, 29-Oct-2012@31-Oct-2012@India  

How much more efficient is it to use a back-reference rather than a positive lookbehind (if at all)?

back-reference:

inputString.replaceAll("(\\d{4})\\s", "$1@");

positive lookbehind:

inputString.replaceAll("(?<=\\d{4})\\s", "@");
like image 616
Michael Avatar asked Oct 31 '12 07:10

Michael


1 Answers

I admit my testing methodology is crude and may be flawed (furthermore I don't know Java, only learning enough to write this answer), but my initial evidence proves contrary to @dasblinkenlight's answer. I ran the following code:

import java.util.*;
import java.lang.*;

class Main
{
    private static void test (String regex, String replace, int repetitions)
    {
        long startTime = System.currentTimeMillis();
        for (int i = 0; i < repetitions; i++)
        {
            String str = "30-Nov-2012 United Kingdom, 31-Oct-2012 31-Oct-2012 United Arab Emirates, 29-Oct-2012 31-Oct-2012 India, ";
            str.replaceAll(regex, replace);
        }
        long endTime = System.currentTimeMillis();
        System.out.println("Execution time: " + Long.toString(endTime - startTime));
    }

    public static void main (String[] args) throws java.lang.Exception
    {
        test("(\\d{4})\\s", "$1@", 10000);
        test("(?<=\\d{4})\\s", "@", 10000);
        test("(\\d{4})\\s", "$1@", 10000);
        test("(?<=\\d{4})\\s", "@", 10000);
        test("(\\d{4})\\s", "$1@", 10000);
        test("(?<=\\d{4})\\s", "@", 10000);
        test("(\\d{4})\\s", "$1@", 10000);
        test("(?<=\\d{4})\\s", "@", 10000);
    }
}

...here, http://ideone.com/WkHLMN, and the output was:

Execution time: 164
Execution time: 140
Execution time: 96
Execution time: 135
Execution time: 95
Execution time: 133
Execution time: 94
Execution time: 130

Ignoring the first set of cases as outliers having to do with initialization, the remainder cases seem to indicate that the latter expression, using the positive lookbehind assertion, might do as much as 50% more work! I suspect this may be the case because backreferences would require going back to check whether an assertion is true, after passing the characters of interest.

like image 174
slackwing Avatar answered Oct 01 '22 08:10

slackwing