Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the downsides of using a simple regex-based markdown parser?

I require a relatively simple markdown parser for my application. Just simple stuff like bolding, italics, etc. I was looking around for libraries and many seem to be quite large. For example, marked is quite popular with 20,000 stars. And it's close to 2,000 lines of code. I'm not even sure how large this one is, but it seems quite complex.

Generally I try to prefer to keep things simple and limit my dependencies whenever possible. I'm not exactly sure what all those lines are doing? I was pleased to soon after find this library which isn't even 100 lines, and it just uses a simple regex to transform text into its corresponding markdown.

My question is, basically, what are those other libraries doing? Am I missing something by opting to use a simpler, regex-focused approach? Is the latter library not safe in some way? Should I be considering some other factor that I am ignorant of?

Clearly there seems to be something of importance that I am missing, because the former libraries seem quite popular, and the latter one has not even a single star. I'm just not sure what that is. I'm hoping that the case is that the latter is fine for simple cases, and the former ones are more "complete" if that's what you need, but I don't want to jump to that conclusion.

like image 468
Ryan Peschel Avatar asked Sep 04 '19 04:09

Ryan Peschel


1 Answers

There are a number of factors which contribute to the complexity of Markdown parsers. That said, you can use a "simple regex-based" method to build a Markdown parser. In fact, this is exactly what the reference implementation uses (in Perl). It runs a series of regular expressions which replace the Markdown syntax with HTML syntax on the existing document. Even then, the source code is comprised of 1451 lines of code, including comments, license, etc. Of course, it includes support for the entire list of features described in the original syntax rules. Those features include things like support for nesting, escaping and the like, which significantly complicate the use of regex.

Some people find such an implementation limiting. It all depends on what you want out of a Markdown parser.

For example, extending the syntax is near impossible with the reference implementation. As an example, Python-Markdown (of which I am a developer) has taken the reference implementation, given each regex a name and provided a way for third-party extensions to replace or insert new regular expressions into the mix. The boilerplate code just to allow this adds considerably more lines of code. By the way, Markdown is old and libs such as Python-Markdown have changed and grown over the years. The first version very closely mimicked the reference implementation, but today you would be hard pressed to see any similarities between them.

Others aren't interested in extending the syntax so much as offering a way to control the output. For example, the marked JS library outputs an abstract syntax tree (AST), which can then be passed to a renderer. Renderers accept the AST (basically a list of tokens) and output some other format. That other format could be HTML, or it could be something else. Pandoc takes advantage of this to convert to and from many document formats. Naturally, this adds additional lines of code.

An additional factor, regardless if implementation, is that many would argue that if an implementation doesn't support all of the features in the rules, then it is not Markdown. In fact, over the years, many implementation have added non-standard features (see GitHub Flavored Markdown as an example). People begin to rely on these non-standard features and will file bug reports complaining that an implementation does not support them. As the developer of Python-Markdown, I regularly see such reports when the lib does in fact offer support. It just isn't enabled by default. When this is pointed out to them, their reaction is often less that understanding. So no implementation made for general consumption will last long without support for all standard features.

Adding additional complication is that there is not perfect agreement among implementations regarding the standard features. See the Babelmark 2 FAQ for details. In that FAQ you will find a lot of documented differences which are rather nuanced. People really find these minor differences important. For that reason, a group of people created Commonmark, a strict specification for Markdown. However, as Commonmark has never received the blessing of the creator of Markdown, some question whether it can be considered Markdown at all. Additionally, in some places the spec, by its own admition, is in direct violation of the original rules. Regardless, for an implementation to be a Commonmark implementation is must provide a complete solution with all of the documented features of the spec. The reference implementations (in JS and C) are both quite large. In fact, I doubt you could implement Commonmark with an implementation which used simple rexed based replacement like markdown.pl does.

The point is that with all but the most simple implementations, you are getting more than simply a collection of regex substitutions. The exact features differ from implementation to implementation and would require a careful reading of the documentation for each. Regardless, even a "simple" collection of regex substitutions is rather quite complex and lengthy to implement all of the documented features of Markdown. Anything less would not be considered Markdown.

Another consideration is performance. While the regex based parsers are 'good enough' for most general use (running from the command line as the reference implementation was designed for), the more performant implementations (such as marked or the Commonmark reference implementation) produce an AST and use a renderer. A regex based implementation will never come close to matching that in performance, which is important if your web server is converting Markdown to HTML on each request.

like image 58
Waylan Avatar answered Oct 07 '22 17:10

Waylan