Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can a Javascript RegExp match at an exact string offset without searching?

I'm parsing a moderately complex grammar using Javascript and I'd like to use regular expressions to match tokens such as numbers.

Given a string containing the grammar, a regular expression representing a number (say) and an offset within the string, I'd like to find out if the regular expression matches the string at that offset exactly.

I can set lastIndex, call RegExp.exec and check the index property of the resulting match to see if the match happened at the expected offset, but this is very inefficient because exec will search the entire string if it doesn't find a match at the starting offset.

The Javascript spec says "A Pattern evaluates ("compiles") to an internal procedure value. RegExp.prototype.exec can then apply this procedure to a String and an offset within the String to determine whether the pattern would match starting at exactly that offset within the String."

This is exactly what I want, but there seems to be no way of accessing this internal function. Does anyone know if there is?

P.S. I'm currently avoiding this problem by splitting the input string into an array of tokens, but I'd prefer not to.

like image 821
arx Avatar asked Dec 21 '22 07:12

arx


1 Answers

I've thoroughly tested possibly efficient methods, see JSPerf: ~20000 chars, ~1000000 chars. I've created a function to generate a random string consisting of alphanumeric characters. After running this function once, a RegExp pattern is created, to match a string with length 10 at a given offset.

Tested cases (when the condition in if(..) is true, the pattern is found at offset index):

var string = "...about 20000 characters: A-Z a-z 0-9...";
var regexp = /abcdef1324/g;
var regexpSubstr = /^abcdefg1234/;
var index = 10000;

/*1*/ if ( regexpSubstr.exec(string.substr(index,10)) ) ;
/*2*/ regexp.lastIndex = index;
      var found =  regexp.exec(string);
      if (found && found.length + index == regexp.lastIndex ) ;

/*3*/ if ( regexpSubstr.test(string.substr(index,10)) ) ;
/*4*/ // Only when the RegExp matches a fixed number of characters
      regexp.lastIndex = index;
      if ( regexp.test(string) && regexp.lastIndex == index + 10 ) ;

Case 1 and Case 3 are equivalent, because they're checking whether a substring matches the /^abcdef1234/ pattern (does the selected substring start with "abc..etc"?).

Case 2 and Case 4 use the .lastIndex method:
    1.   Set the .lastIndex property of the RegExp to the desired offset
    2.   Check whether the pattern is found.
    3.   Check whether a found pattern is positioned at offset index.
These methods require a Regular expression to have enabled the global flag.

At very large strings, method 4 (lastIndex + test) is proved to be most efficient whn a match at the offset occurs. Method 4, however, requires the matched pattern to have a pre-defined, fixed size.

Method 3 (substr + test) is slightly slower than 4, when a match occurs at the given position. When no match is found at a large string, however, method 3 is significantly faster than method 4. Method 1 and method 3 seems to be equally fast when no match is found.

RegExp methods
The .exec seems to not be more efficient than .test. The match method is not suitable for this case, since it attempts to find all matches, regardless of the .lastIndex property. Another possible RegExp function is the .search function, which is significantly slower for large strings, compared to previously shown methods.

like image 137
Rob W Avatar answered Dec 24 '22 00:12

Rob W