I have a problem with the following regular expression:
var s = "http://www.google.com/dir/file\r\nhello"
var re = new RegExp("http://([^/]+).*/([^/\r\n]+)$");
var arr = re.exec(s);
alert(arr[2]);
Above, I expect arr[2] (i.e. capture group 2) to be "file", matching against the last 4 character in the first line after applying a greedy .*, backtracking due to / in the pattern, and then anchoring against the end of line by $.
In fact, arr[] is null, which implies that the pattern did not even match.
I can alter this slightly so it does precisely what I intend:
var s = "http://www.google.com/dir/file\r\nhello"
var re = new RegExp("http://([^/]+).*/([^/\r\n]+)[\r\n]*");
var arr = re.exec(s);
alert(arr[2]); // "file", as expected
My question is not so how much HOW to grab "file" from the end of the first line in s. Instead, I'm trying to understand WHY the first regexp fails and the second succeeds. Why does $ not match against the \r\n line break in example 1? Isn't that the sole purpose of its existence? Is there something else I'm missing?
Also, consider the same first regular expression as used in sed (with extended regular expression mode enabled with -r):
$ echo -e "http://www.google.com/dir/file\r\nhello" |sed -r -e 's#http://([^/]+).*/([^/\r\n]+)$#\2.OUTSIDE.OF.CAPTURE.GROUP#'
<<OUTPUT>>
file.OUTSIDE.OF.CAPTURE.GROUP
hello
Here, capture group 2 captures "file" and nothing else. "hello" appears in the output, but does not exist inside the capture group, which is proven by the position of string ".OUTSIDE.OF.CAPTURE.GROUP" in the output. So the regular expression works according to my understanding in sed, but not using the built in Javascript regexp engine.
If I replace \r\n in the input string with just \n, the behavior is identical for all three above examples, so that should not be relevant as far as I can tell.
Line Anchors In regex, anchors are not used to match characters. Rather they match a position i.e. before, after, or between characters. To match start and end of line, we use following anchors: Caret (^) matches the position before the first character in the string. Dollar ($) matches the position right after the last character in the string. 2.
You can use the same method to expand the match of any regular expression to an entire line, or a block of complete lines. In some cases, such as when using alternation, you will need to group the original regex together using parentheses.
Finally, .*$ causes the regex to actually match the line, after the lookaheads have determined it meets the requirements. If your condition is that a line should not contain something, use negative lookahead. ^((?!regexp).)*$ matches a complete line that does not match regexp.
For the positive lookahead, we only need to find one location where it can match. But the negative lookahead must be tested at each and every character position in the line. We must test that regexp fails everywhere, not just somewhere.
You need to enable regex multiline mode to match end of line characters
var re = new RegExp("http://([^/]+).*/([^/\r\n]+)$", "m");
http://javascript.info/tutorial/ahchors-and-multiline-mode
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With