I was reading this article today on two different regular expression algorithms.
According to the article old Unix tools like ed, sed, grep, egrep, awk, and lex, all use what's called the Thompson NFA algorithm in their regular expresssions...
However newer tools like Java, Perl, PHP, and Python all use a different algorithm for their regular expressions that are much, much slower.
This article makes no mention at all of Javascript's regex algorthim, (and yes I know there are various JS engines out there) but I was wondering if anybody knew which of those algorithms they use, and if maybe those algorithms should be swapped out for Thompson NFA.
Most library implementations of regular expressions use a backtracking algorithm that can take an exponential amount of time on some inputs.
Regular expressions are patterns that provide a powerful way to search and replace in text. In JavaScript, they are available via the RegExp object, as well as being integrated in methods of strings.
JavaScript regular expression object provides two methods: test() and exec() to evaluate a regular expression.
They form part of the basic techniques in NLP and learning them will make you a more efficient programmer. Therefore, Regular Expression is one of the key concepts of Natural Language Processing that every NLP expert should be proficient in.
The Javascript ECMA language description doesn't impose a requirement for the particular implementation of regular expressions, so that part of the question isn't well-formed. You're really wondering about the particular implementation in a particular browser.
The reason Perl/Python etc use a slower algorithm, though, is that the regex language defined isn't really regular expressions. A real regular expression can be expressed as a finite state machine, but the language of regex is context free. That's why the fashion is to just call it "regex" instead of talking about regular expressions.
Yes, in fact javascript regex isn't content free regular. Consider the syntax using `{n,m}', that is, matches from n to m accepted regexs. Let d the difference d=|n-m|. The syntax means there exists a string uxdw that is acceptable, but a string uxk>dw that is not. It follows via the pumping lemma for regular languages that this is not a regular language.
(augh. Thinko corrected.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With