Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to overcome the lack of Perl's \G in JavaScript code?

In Perl, when one wants to do continuous parsing on a string, it can be done something like this my $string = " a 1 # ";

while () {
    if ( $string =~ /\G\s+/gc )    {
        print "whitespace\n";
    }
    elsif ( $string =~ /\G[0-9]+/gim ) {
        print "integer\n";
    }
    elsif ( $string =~ /\G\w+/gim ) {
        print "word\n";
    }
    else {
        print "done\n";
        last;
    }
}

Source: When is \G useful application in a regex?

It produces the following output:

whitespace
word
whitespace
integer
whitespace
done

In JavaScript (and many other regular expressions flavors) there is no \G pattern, nor any good replacement.

So I came up with a very simple solution that serves my purpose.

<!-- language: lang-js --> 
//*************************************************
// pattmatch - Makes the PAT pattern in ST from POS
// notice the "^" use to simulate "/G" directive
//*************************************************
function pattmatch(st,pat,pos)
{
var resu;
pat.lastIndex=0;
if (pos===0)  
    return  pat.exec(st);    // busca qualquer identificador  
else  {
  resu = pat.exec(st.slice(pos));    // busca qualquer identificador  
  if (resu) 
      pat.lastIndex = pat.lastIndex + pos;
  return resu;
}  // if

}

So, the above example would look like this in JavaScript (node.js):

<!-- language: lang-js -->
var string = " a 1 # ";
var pos=0, ret;  
var getLexema  = new RegExp("^(\\s+)|([0-9]+)|(\\w+)","gim");  
while (pos<string.length && ( ret = pm(string,getLexema,pos)) ) {
    if (ret[1]) console.log("whitespace");
    if (ret[2]) console.log("integer");
    if (ret[3]) console.log("word");
    pos = getLexema.lastIndex;
}  // While
console.log("done");

It produces the same output than Perl code snippet:

whitespace
word
whitespace
integer
whitespace
done

Notice the parser stop at # character. One can continue parsing in another code snippet from pos position.

Is there a better way in JavaScript to simulate Perl's /G regex pattern?

Post edition

For curiosity, I've decided to compare my personal solution with @georg proposal. Here I do not state which code is best. For me, tt's a matter of taste.

It will my system, which will depend a lot on user interaction, become slow?

@ikegami writes about @georg solution:

... his solution adds is a reduction in the number of times your input file is copied ...

So I've decided compare both solutions in a loop that repeats the code code 10 million times:

<!-- language: lang-js -->
var i;
var n1,n2;
var string,pos,m,conta,re;

// Mine code
conta=0;
n1 = Date.now();
for (i=0;i<10000000;i++) {
  string = " a 1 # ";
  pos=0, m;  
  re  = new RegExp("^(\\s+)|([0-9]+)|(\\w+)","gim");  
  while (pos<string.length && ( m = pattMatch(string,re,pos)) ) {
    if (m[1]) conta++;
    if (m[2]) conta++;
    if (m[3]) conta++;
    pos = re.lastIndex;
  }  // While
}
n2 = Date.now();
console.log('Mine: ' , ((n2-n1)/1000).toFixed(2), ' segundos' );


// Other code
conta=0;
n1 = Date.now();

for (i=0;i<10000000;i++) {
  string = " a 1 # ";
  re  = /^(?:(\s+)|([0-9]+)|(\w+))/i;
  while (m = string.match(re)) {
   if (m[1]) conta++;
   if (m[2]) conta++;
   if (m[3]) conta++;
   string = string.slice(m[0].length)
 }
 }
n2 = Date.now();
console.log('Other: ' , ((n2-n1)/1000).toFixed(2) , ' segundos');

//*************************************************
// pattmatch - Makes the PAT pattern in ST from POS
// notice the "^" use to simulate "/G" directive
//*************************************************
function pattMatch(st,pat,pos)
{
var resu;
pat.lastIndex=0;
if (pos===0)  
    return  pat.exec(st);    
else  {
  resu = pat.exec(st.slice(pos)); 
  if (resu) 
      pat.lastIndex = pat.lastIndex + pos;
  return resu;
}  
} // pattMatch

Results:

Mine: 11.90 segundos
Other: 10.77 segundos

My code runs about 10% longer. It spends about 110 nanoseconds more per iteration.

Honestly, according to my personal preference, I accept this loss of efficiency as acceptable to me, in a system with heavy user interaction.

If my project involved heavy mathematical processing with multidimensional arrays or gigantic neural networks, I might rethink.

like image 605
Paulo Buchsbaum Avatar asked Aug 01 '17 13:08

Paulo Buchsbaum


People also ask

What is RegEx object in JavaScript?

A regular expression is an object that describes a pattern of characters. The JavaScript RegExp class represents regular expressions, and both String and RegExp define methods that use regular expressions to perform powerful pattern-matching and search-and-replace functions on text.

What does * do in regex?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.

What is JavaScript modifier?

Definition and UsageThe "m" modifier specifies a multiline match. It only affects the behavior of start ^ and end $. ^ specifies a match at the start of a string. $ specifies a match at the end of a string. With the "m" set, ^ and $ also match at the beginning and end of each line.

What is regular expression in JavaScript examples?

A regular expression is a sequence of characters that forms a search pattern. When you search for data in a text, you can use this search pattern to describe what you are searching for. A regular expression can be a single character, or a more complicated pattern.

Is Perl easier to learn than other programming languages?

Since Perl is a lot similar to other widely used languages syntactically, it is easier to code and learn in Perl. Perl programs can be written on any plain text editor like notepad, notepad++, or anything of that sort.

What is a Perl function or subroutine?

A Perl function or subroutine is a group of statements that together perform a specific task. In every programming language user want to reuse the code. So the user puts the section of code in function or subroutine so that there will be no need to write code again and again. Hello Geeks!

Why use Perl for system administration?

Instead of becoming dependent on many languages, just use Perl to complete out the whole task of system administration. Web and Perl: Perl can be embedded into web servers to increase its processing power and it has the DBI package, which makes web-database integration very easy.

What is a string in Perl?

A string in Perl is a scalar variable and starts with a ($) sign and it can contain alphabets, numbers, special characters. The string can consist of a single word, a group of words, or a multi-line paragraph.


2 Answers

The functionality of \G exists in form of the /y flag.

var regex = /^foo/y;
regex.lastIndex = 2;
regex.test('..foo');   // false - index 2 is not the beginning of the string

var regex2 = /^foo/my;
regex2.lastIndex = 2;
regex2.test('..foo');  // false - index 2 is not the beginning of the string or line
regex2.lastIndex = 2;
regex2.test('.\nfoo'); // true - index 2 is the beginning of a line

But it's quite new. You won't be able to use it on public web sites yet. Check the browser compatibility chart in the linked documentation.

like image 81
ikegami Avatar answered Nov 01 '22 13:11

ikegami


Looks like you're overcomplicating it a bit. exec with the g flag provides anchoring out of the box:

var 
    string = " a 1 # ",
    re  = /(\s+)|([0-9]+)|(\w+)|([\s\S])/gi,
    m;

while (m = re.exec(string)) {
    if (m[1]) console.log('space');
    if (m[2]) console.log('int');
    if (m[3]) console.log('word');
    if (m[4]) console.log('unknown');    
}

If your regexp is not covering, and you want to stop on the first non-match, the simplest way would be to match from the ^ and strip the string once matched:

    var 
        string = " a 1 # ",
        re  = /^(?:(\s+)|([0-9]+)|(\w+))/i,
        m;

    while (m = string.match(re)) {
        if (m[1]) console.log('space');
        if (m[2]) console.log('int');
        if (m[3]) console.log('word');
        string = string.slice(m[0].length)
    }

    console.log('done, rest=[%s]', string)

This simple method doesn't fully replace \G (or your "match from" method), because it loses the left context of the match.

like image 42
georg Avatar answered Nov 01 '22 13:11

georg