JsLex is a Javascript lexer I've written in Python. It does a good job for a day's work (or so), but I'm sure there are cases it gets wrong. In particular, it doesn't understand anything about semicolon insertion, and there are probably ways that's important for lexing. I just don't know what they are.
What Javascript code does JsLex lex incorrectly? I'm especially interested in valid Javascript source where JsLex incorrectly identifies regex literals.
Just to be clear, by "lexing" I mean identifying tokens in a source file. JsLex makes no attempt to parse Javascript, much less execute it. I've written JsLex to do full lexing, though to be honest I would be happy if it merely was able to successfully find all the regex literals.
Interestingly enough I tried your lexer on the code of my lexer/evaluator written in JS ;) You're right, it is not always doing well with regular expressions. Here some examples:
rexl.re = {
NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
UNQUOTED_LITERAL: /^@(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
QUOTED_LITERAL: /^'(?:[^']|'')*'/,
NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|@|\^|\/\+|\/|\*|\+|-)/
};
This one is mostly fine - only UNQUITED_LITERAL
is not recognized, otherwise all is fine. But now let's make a minor addition to it:
rexl.re = {
NAME: /^(?!\d)(?:\w)+|^"(?:[^"]|"")+"/,
UNQUOTED_LITERAL: /^@(?:(?!\d)(?:\w|\:)+|^"(?:[^"]|"")+")\[[^\]]+\]/,
QUOTED_LITERAL: /^'(?:[^']|'')*'/,
NUMERIC_LITERAL: /^[0-9]+(?:\.[0-9]*(?:[eE][-+][0-9]+)?)?/,
SYMBOL: /^(?:==|=|<>|<=|<|>=|>|!~~|!~|~~|~|!==|!=|!~=|!~|!|&|\||\.|\:|,|\(|\)|\[|\]|\{|\}|\?|\:|;|@|\^|\/\+|\/|\*|\+|-)/
};
str = '"';
Now all after the NAME's
regexp messes up. It makes 1 big string. I think the latter problem is that String token is too greedy. The former one might be too smart regexp for the regex
token.
Edit: I think I've fixed the regexp for the regex
token. In your code replace lines 146-153 (the whole 'following characters' part) with the following expression:
([^/]|(?<!\\)(?<=\\)/)*
The idea is to allow everything except /
, also allow \/
, but not allow \\/
.
Edit: Another interesting case, passes after the fix, but might be interesting to add as the built-in test case:
case 'UNQUOTED_LITERAL':
case 'QUOTED_LITERAL': {
this._js = "e.str(\"" + this.value.replace(/\\/g, "\\\\").replace(/"/g, "\\\"") + "\")";
break;
}
Edit: Yet another case. It appears to be too greedy about keywords as well. See the case:
var clazz = function() {
if (clazz.__) return delete(clazz.__);
this.constructor = clazz;
if(constructor)
constructor.apply(this, arguments);
};
It lexes it as: (keyword, const), (id, ructor)
. The same happens for an identifier inherits
: in
and herits
.
Example: The first occurrence of / 2 /i
below (the assignment to a
) should tokenize as Div, NumericLiteral, Div, Identifier, because it is in a InputElementDiv context. The second occurrence (the assignment to b
) should tokenize as RegularExpressionLiteral, because it is in a InputElementRegExp context.
i = 1;
var a = 1 / 2 /i;
console.info(a); // ⇒ 0.5
console.info(typeof a); // number
var b = 1 + / 2 /i;
console.info(b); // ⇒ 1/2/i
console.info(typeof b); // ⇒ string
Source:
There are two goal symbols for the lexical grammar. The InputElementDiv symbol is used in those syntactic grammar contexts where a division (
/
) or division-assignment (/=
) operator is permitted. The InputElementRegExp symbol is used in other syntactic grammar contexts.Note that contexts exist in the syntactic grammar where both a division and a RegularExpressionLiteral are permitted by the syntactic grammar; however, since the lexical grammar uses the InputElementDiv goal symbol in such cases, the opening slash is not recognised as starting a regular expression literal in such a context. As a workaround, one may enclose the regular expression literal in parentheses. — Standard ECMA-262 3rd Edition - December 1999, p. 11
The simplicity of your solution for handling this hairy problem is very cool, but I noticed that it doesn't quite handle a change in something.property
syntax for ES5, which allows reserved words following a .
. I.e., a.if = 'foo'; (function () {a.if /= 3;});
, is a valid statement in some recent implementations.
Unless I'm mistaken there is only one use of .
anyway for properties, so the fix could be adding an additional state following the .
which only accepts the identifierName token (which is what identifier uses, but it doesn't reject reserved words) would probably do the trick. (Obviously the div state follows that as per usual.)
I've been thinking about the problems of writing a lexer for JavaScript myself, and I just came across your implementation in my search for good techniques. I found a case where yours doesn't work that I thought I'd share if you're still interested:
var g = 3, x = { valueOf: function() { return 6;} } /2/g;
The slashes should both be parsed as division operators, resulting in x being assigned the numeric value 1. Your lexer thinks that it is a regexp. There is no way to handle all variants of this case correctly without maintaining a stack of grouping contexts to distinguish among the end of a block (expect regexp), the end of a function statement (expect regexp), the end of a function expression (expect division), and the end of an object literal (expect division).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With