I am trying to extract a code block
from a Markdown document using PCRE RegEx. For the uninitiated, a code block in Markdown is defined thus:
To produce a code block in Markdown, simply indent every line of the block by at least 4 spaces or 1 tab. A code block continues until it reaches a line that is not indented (or the end of the article).
So, given this text:
This is a code block:
I need capturing along with
this line
This is a code fence below (to be ignored):
``` json
This must have three backticks
flanking it
```
I love `inline code` too but don't capture
and one more short code block:
Capture me
So far I have this RegEx:
(?:[ ]{4,}|\t{1,})(.+)
But it simply captures each line prefixed with at least four spaces or one tab. It doesn't capture the whole block.
What I need help with is how to set the condition to capture everything after 4 spaces or 1 tab until you either get to a line that is not indented or the end of the text.
Here's an online work in progress:
https://www.regex101.com/r/yMQCIG/5
You should use begin/end-of-string markers (^
and $
in combination with the m
modifier). Also, your test text had only 3 leading spaces in the final block:
^((?:(?:[ ]{4}|\t).*(\R|$))+)
With \R
and the repetition you match one whole block with each single match, instead of a line per match.
See demo on regex101
Disclaimer: The rules of markdown are more complicated than the presented example text shows. For instance, when (nested) lists have code blocks in them, these need to be prefixed with 8, 12 or more spaces. Regular expressions are not suitable to identify such code blocks, or other code blocks embedded in markdown notation that uses the wider range of format combinations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With