I am creating a syntax highlighter, and I am using String.split to create tokens from an input string. The first issue is that String.split creates a huge amount of empty strings, which causes everything to be quite slower than it could otherwise be.
For example, "***".split(/(\*)/)
-> ["", "*", "", "*", "", "*", ""]
.
Is there a way to avoid this?
Another issue is the expression precedence in the regular expression itself.
Let's say I am trying to parse a C style multi-line comment.
That is, /* comment */
.
Now let's assume the input string is "/****/"
.
If I were to use the following regular expression, it would work, but produce a lot of extra tokens (and all those empty strings!).
/(\/\*|\*\/|\*)/
A better way is to read /*
's, */
's and then read all the rest of the *
's in one token.
That is, the better result for the above string is ["/*", "**", "*/"]
.
However, when using the regular expression that should do this, I get bad results.
The regular expression is like so: /(\/\*|\*\/|\*+)/
.
The result of this expression is however this: ["/*", "***", "/"]
.
I am guessing this is because the last part is greedy so it steals the match from the other part.
The only solution I found was to make a negated lookahead expression, like this:
/(\/\*|\*\/|\*+(?!\/)/
This gives the expected result, but it is very slow compared to the other one, and this has an effect for big strings.
Is there a solution for either of these problems?
Use lookahed to avoid empty matches:
arr = "***".split(/(?=\*)/);
//=> ["*", "*", "*"]
OR use filter(Boolean)
to discard empty matches:
arr = "***".split(/(\*)/).filter(Boolean);
//=> ["*", "*", "*"]
Generally for tokenizing you use match
, not split
:
> str = "/****/"
"/****/"
> str.match(/(\/\*)(.*?)(\*\/)/)
["/****/", "/*", "**", "*/"]
Also note how the non-greedy modifier ?
solves the second problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With