Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for Markdown Table Syntax?

I'm currently developing a little tool that allows me to convert Github wikis to Github pages. Now I'm trying to add a proper support for Markdown tables (not supported by the parser I'm using).

Now I hook up to the parser's lexer and then extend it with various Github wiki specific tweaks (ie. links) and then pass the modified tokens back to the parser. Tables should fit this scheme as well. My tweaks use various regex patterns and regex replace in order to perform the modifications I need.

I'm a bit stuck with the complicated table syntax, though. You can find an example of that here and here. As you can see there's some structure but some parts are entirely optional.

I've given some thought about it and I think I would like a regex that would output me a group containing the header (first line), the column alignment data (second line) and actual content as separate groups. It should contain at least one content line in order to match. The header and alignment data also has to obey certain rules as seen on the examples.

How would you approach building a regex such as this? Better yet, can someone provide me some starting point where to build upon? It's possible my approach is misguided (perhaps regex can be avoided?). If so, any ideas leading to the same results easier are appreciated.

like image 301
Juho Vepsäläinen Avatar asked Mar 23 '12 10:03

Juho Vepsäläinen


4 Answers

I need a regex solution to the same problem. Here's what I've got so far, will update it as I am able to improve it:

|(?:([^\r\n|]*)\|)+\r?\n\|(?:(:?-+:?)\|)+\r?\n(\|(?:([^\r\n|]*)\|)+\r?\n)+

Regular expression visualization

Debuggex Demo

Tested with javascript

like image 109
Sean Avatar answered Oct 17 '22 21:10

Sean


I had the same problem, and never finding a suitable answer, I eventually came up with the following.

^(\|[^\n]+\|\r?\n)((?:\|:?[-]+:?)+\|)(\n(?:\|[^\n]+\|\r?\n?)*)?$

Flags are "Global", and "Multiline".

Although it's not really based on Sean's answer, it did end up being rather similar, with a few notable differences such as being a little shorter, completing in fewer steps (59 vs 126 steps, according to regex101.com), and having probably more "sensible" capturing groups. Plus it allows for "incomplete" tables too. (As in no "body"). (The reason I'm adding it in a separate answer is that I really do find it more useful, plus my ego would not allow me to do otherwise) ;).

In a nutshell:

  • It will only allow for "strict" markdown tables, where every line starts and ends with a | character, and the "cell alignments" line is properly formatted.
  • First group captures the "head", the second group the "cell alignments" line, and the (optional) third group captures the "body".
  • It needs at least one completed and correctly formatted "alignment" cell to consider it a table, but will match incomplete tables otherwise (i.e. with no "body").

Tested in Java, (Android) and here:

Regex101

and here:

Debuggex Demo

Hope it helps someone. :)

like image 30
Attila Orosz Avatar answered Oct 17 '22 21:10

Attila Orosz


Somethin that I did:

  1. Regex for parsing table header and delimiter!!

[|]?(\s+[A-Za-z0-9 -_*#@$%:;?!.,\/\\]+\s+)[|]?[|]?(\s+[A-Za-z0-9 -_*#@$%:;?!.,\/\\]+\s+)[|]?[|]?(\s+[A-Za-z0-9 -_*#@$%:;?!.,\/\\]+\s+)[|]?\r?\n?\|?:-+:\|?:-+:\|?:-+:\|?
—modifier - global

  1. Regex for delimiter between header and text

\|?:-+:\|?:-+:\|?:-+:\|?\r?\n?
-- modifier - global

  1. Regex for parsing ALL elements before and after header delimiter!

[|]?(\s+[A-Za-z0-9 -_*#@$%:;?!.,\/\\]+\s+)[|]?[|]?(\s+[A-Za-z0-9 -_*#@$%:;?!.,\/\\]+\s+)[|]?[|]?(\s+[A-Za-z0-9 -_*#@$%:;?!.,\/\\]+\s+)[|]?\r?\n?

—modifiers - global, multiline

This is table, for parsing.

| Tables | Are | Cool | |:-------------:|:-------------:|:-----:| | col 3 is | r-l | $1600 | | col 2 is | centered | $12 | | zebra stripes | are neat | $1 |

like image 34
ryodeushii Avatar answered Oct 17 '22 21:10

ryodeushii


I ended up skipping Regex altogether and just hacked it together using conventional logic. It might not be as pretty or short as a Regex based one but at least I can maintain this easily.

I did find some Regexes that might have fit this purpose btw. See MultiMarkdown.

like image 44
Juho Vepsäläinen Avatar answered Oct 17 '22 19:10

Juho Vepsäläinen