Find out the position where a regular expression failed

Tags:

I'm trying to write a lexer in JavaScript for finding tokens of a simple domain-specific language. I started with a simple implementation which just tries to match subsequent regexps from the current position in a line to find out whether it matches some token format and accept it then.

The problem is that when something doesn't match inside such regexp, the whole regexp fails, so I don't know which character exactly caused it to fail.

Is there any way to find out the position in the string which caused the regular expression to fail?

INB4: I'm not asking about debugging my regexp and verifying its correctness. It is correct already, matches correct strings and drops incorrect ones. I just want to know programmatically where exactly the regexp stopped matching, to find out the position of a character which was incorrect in the user input, and how much of them were OK.

Is there some way to do it with just simple regexps instead of going on with implementing a full-blown finite state automaton?

575

asked May 23 '14 22:05

SasQ

2 Answers

Short answer

There is no such thing as a "position in the string that causes the regular expression to fail".

However, I will show you an approach to answer the reverse question:

At which token in the regex did the engine become unable to match the string?

Discussion

In my view, the question of the position in the string which caused the regular expression to fail is upside-down. As the engine moves down the string with the left hand and the pattern with the right hand, a regex token that matches six characters one moment can later, because of quantifiers and backtracking, be reduced to matching zero characters the next—or expanded to match ten.

In my view, a more proper question would be:

At which token in the regex did the engine become unable to match the string?

For instance, consider the regex ^\w+\d+$ and the string abc132z.

The \w+ can actually match the entire string. Yet, the entire regex fails. Does it make sense to say that the regex fails at the end of the string? I don't think so. Consider this.

Initially, \w+ will match abc132z. Then the engine advances to the next token: \d+. At this stage, the engine backtracks in the string, gradually letting the \w+ give up the 2z (so that the \w+ now only corresponds to abc13), allowing the \d+ to match 2.

At this stage, the $ assertion fails as the z is left. The engine backtracks, letting the \w+, give up the 3 character, then the 1 (so that the \w+ now only corresponds to abc), eventually allowing the \d+ to match 132. At each step, the engine tries the $ assertion and fails. Depending on engine internals, more backtracking may occur: the \d+ will give up the 2 and the 3 once again, then the \w+ will give up the c and the b. When the engine finally gives up, the \w+ only matches the initial a. Can you say that the regex "fails on the "3"? On the "b"?

No. If you're looking at the regex pattern from left to right, you can argue that it fails on the $, because it's the first token we were not able to add to the match. Bear in mind that there are other ways to argue this.

Lower, I'll give you a screenshot to visualize this. But first, let's see if we can answer the other question.

The Other Question

Are there techniques that allow us to answer the other question:

At which token in the regex did the engine become unable to match the string?

It depends on your regex. If you are able to slice your regex into clean components, then you can devise an expression with a series of optional lookaheads inside capture groups, allowing the match to always succeed. The first unset capture group is the one that caused the failure.

Javascript is a bit stingy on optional lookaheads, but you can write something like this:

^(?:(?=(\w+)))?(?:(?=(\w+\d+)))?(?:(?=(\w+\d+$)))?.

In PCRE, .NET, Python... you could write this more compactly:

^(?=(\w+))?(?=(\w+\d+))?(?=(\w+\d+$))?.

What happens here? Each lookahead builds incrementally on the last one, adding one token at a time. Therefore we can test each token separately. The dot at the end is an optional flourish for visual feedback: we can see in a debugger that at least one character is matched, but we don't care about that character, we only care about the capture groups.

Group 1 tests the \w+ token
Group 2 seems to test \w+\d+, therefore, incrementally, it tests the \d+ token
Group 3 seems to test \w+\d+$, therefore, incrementally, it tests the $ token

There are three capture groups. If all three are set, the match is a full success. If only Group 3 is not set (as with abc123a), you can say that the $ caused the failure. If Group 1 is set but not Group 2 (as with abc), you can say that the \d+ caused the failure.

For reference: Inside View of a Failure Path

For what it's worth, here is a view of the failure path from the RegexBuddy debugger.

RegexBuddy Debug

143

answered Sep 21 '22 12:09

zx81

You can use a negated character set RegExp,

[^xyz]
[^a-c]
A negated or complemented character set. That is, it matches anything that is not enclosed in the brackets. You can specify a range of characters by using a hyphen, but if the hyphen appears as the first or last character enclosed in the square brackets it is taken as a literal hyphen to be included in the character set as a normal character.

index property of String.prototype.match()

The returned Array has an extra input property, which contains the original string that was parsed. In addition, it has an index property, which represents the zero-based index of the match in the string.

For example to log index where digit is matched for RegExp /[^a-zA-z]/ in string aBcD7zYx

var re = /[^a-zA-Z]/;
var str = "aBcD7zYx";
var i = str.match(re).index;
console.log(i); // 4

answered Sep 20 '22 12:09

guest271314

Related questions
                            
                                ContentEditable - Get current font color/size
                            
                                Regex for AM PM time format for jquery
                            
                                Rendering Backbone.js Collection as a select list
                            
                                get text of an element without children in javascript
                            
                                How can I add an <image> element to the SVG DOM
                            
                                Make AJAX "get" function synchronous / how to get the result?
                            
                                How can I get e.offsetX on mobile/iPad
                            
                                Kendo Refresh (DropDownList.refresh()) not working ERROR Not define
                            
                                Can I force a hard refresh on an iframe with JavaScript?
                            
                                Angularjs filter error: "Error: Unknown provider: textProvider"
                            
                                Unit testing an asynchronous service in angularjs
                            
                                Why does this Angular controller throw "Error: Unknown provider: nProvider <- n"?
                            
                                jQuery: Finding duplicate ID's and removing all but the first
                            
                                How to display selected image without sending data to server?
                            
                                Passing a global variable to a function
                            
                                AngularJS throws Unknown provider: $scopeProvider <- $scope error when I try to use modules
                            
                                Disable autoplay in youtube javascript api
                            
                                Animate changing words width
                            
                                How to make a JSONP POST request in angular?
                            
                                How do I resize a WebDriverJS browser window?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find out the position where a regular expression failed

Tags:

javascript

regex

lexical-analysis

SasQ

People also ask

2 Answers

zx81

guest271314

Recent Activity

Donate For Us