Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Moving index in JavaScript regex matching

I have this regex to extract double words from text

/[A-Za-z]+\s[A-Za-z]+/g

And this sample text

Mary had a little lamb

My output is this

[0] - Mary had; [1] - a little;

Whereas my expected output is this:

[0] - Mary had; [1] - had a; [2] - a little; [3] - little lamb

How can I achieve this output? As I understand it, the index of the search moves to the end of the first match. How can I move it back one word?

like image 928
Conversation Company Avatar asked Dec 29 '12 12:12

Conversation Company


People also ask

Can you use regex to index?

Conclusion. We can find the indexes of one or more matches of a regex pattern within a JavaScript string by using the RegExp.

How do you match expressions in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

Which is faster RegExp match or RegExp test?

Use . test if you want a faster boolean check. Use . match to retrieve all matches when using the g global flag.

Is match JavaScript regex?

match() is an inbuilt function in JavaScript used to search a string for a match against any regular expression. If the match is found, then this will return the match as an array. Parameters: Here the parameter is “regExp” (i.e. regular expression) which will compare with the given string.


4 Answers

Abusing String.replace function

I use a little trick using the replace function. Since the replace function loops through the matches and allows us to specify a function, the possibility is infinite. The result will be in output.

var output = [];
var str = "Mary had a little lamb";
str.replace(/[A-Za-z]+(?=(\s[A-Za-z]+))/g, function ($0, $1) {
    output.push($0 + $1);
    return $0; // Actually we don't care. You don't even need to return
});

Since the output contains overlapping portion in the input string, it is necessary to not to consume the next word when we are matching the current word by using look-ahead 1.

The regex /[A-Za-z]+(?=(\s[A-Za-z]+))/g does exactly as what I have said above: it will only consume one word at a time with the [A-Za-z]+ portion (the start of the regex), and look-ahead for the next word (?=(\s[A-Za-z]+)) 2, and also capture the matched text.

The function passed to the replace function will receive the matched string as the first argument and the captured text in subsequent arguments. (There are more - check the documentation - I don't need them here). Since the look-ahead is zero-width (the input is not consumed), the whole match is also conveniently the first word. The capture text in the look-ahead will go into the 2nd argument.

Proper solution with RegExp.exec

Note that String.replace function incurs a replacement overhead, since the replacement result is not used at all. If this is unacceptable, you can rewrite the above code with RegExp.exec function in a loop:

var output = [];
var str = "Mary had a little lamb";
var re = /[A-Za-z]+(?=(\s[A-Za-z]+))/g;
var arr;

while ((arr = re.exec(str)) != null) {
    output.push(arr[0] + arr[1]);
}

Footnote

  1. In other flavor of regex which supports variable width negative look-behind, it is possible to retrieve the previous word, but JavaScript regex doesn't support negative look-behind!.

  2. (?=pattern) is syntax for look-ahead.

Appendix

String.match can't be used here since it ignores the capturing group when g flag is used. The capturing group is necessary in the regex, as we need look-around to avoid consuming input and match overlapping text.

like image 138
nhahtdh Avatar answered Nov 02 '22 01:11

nhahtdh


It can be done without regexp

"Mary had a little lamb".split(" ")
      .map(function(item, idx, arr) { 
          if(idx < arr.length - 1){
              return item + " " + arr[idx + 1];
          }
       }).filter(function(item) {return item;})
like image 33
Yury Tarabanko Avatar answered Nov 01 '22 23:11

Yury Tarabanko


Here's a non-regex solution (it's not really a regular problem).

function pairs(str) {
  var parts = str.split(" "), out = [];
  for (var i=0; i < parts.length - 1; i++) 
    out.push([parts[i], parts[i+1]].join(' '));
  return out;
}

Pass your string and you get an array back.

demo


Side note: if you're worried about non-words in your input (making a case for regular expressions!) you can run tests on parts[i] and parts[i+1] inside the for loop. If the tests fail: don't push them onto out.

like image 2
Brigand Avatar answered Nov 01 '22 23:11

Brigand


A way that you could like could be this one:

var s = "Mary had a little lamb";

// Break on each word and loop
s.match(/\w+/g).map(function(w) {

    // Get the word, a space and another word
    return s.match(new RegExp(w + '\\s\\w+'));

// At this point, there is one "null" value (the last word), so filter it out
}).filter(Boolean)

// There, we have an array of matches -- we want the matched value, i.e. the first element
.map(Array.prototype.shift.call.bind(Array.prototype.shift));

If you run this in your console, you'll see ["Mary had", "had a", "a little", "little lamb"].

With this way, you keep your original regex and can do the other stuff you want in it. Although with some code around it to make it really work.

By the way, this code is not cross-browser. The following functions are not supported in IE8 and below:

  • Array.prototype.filter
  • Array.prototype.map
  • Function.prototype.bind

But they're easily shimmable. Or the same functionality is easily achievable with for.

like image 1
Florian Margaine Avatar answered Nov 02 '22 00:11

Florian Margaine