Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using regex to fix markdown input - link labels

We've run into a problem with some markdown content. A few jquery editors we used did not write proper markdown syntax. Embedded Links used the 'label' format, which drops the links at the bottom of the document ( Just like the StackOverflow editor ). The problem we encountered, is that the links were sometimes formatted in a non-standard way. While they were allowed to be prefixed with 0,3 spaces, some came in at 4 spaces (You might notice that StackOverflow forces 2 spaces in javascript) -- which triggers it as preformatted text in markdown parsers.

As a quick example:

This is a sample doucument that would have inline links. 
[Example 0][0], [Example 1][1], [Example 2][2] , [Example 3][3] , [Example 4][4]

[0]: http://example.com
 [1]:      http://example.com/1
  [2] : http://example.com/2
   [3]: http://example.com/3
    [4]  : http://example.com/4

I'm wanting to reformat this last section into proper markdown:

[0]: http://example.com
[1]: http://example.com/1
[2]: http://example.com/2
[3]: http://example.com/3
[4]: http://example.com/4

I'm running into a wall trying to come up with the right regex to catch the 'labels' section. I can grab the labels within the section fine -- but the section is eluding me.

Here's what I have so far:

RE_footnote = re.compile("""
    (?P<labels_section>
        ^[\t\ ]*$                             ## we must start with an empty line
        \s+                       
        (?P<labels>
            (?P<a_label>
                ^
                    [\ \t]*                     ## we could have 0-n spaces or tabs
                    \[                          ## BRACKET - open
                        (?P<id>
                            [^^\]]+
                        )
                    \]                          ## BRACKET - close
                    \s*
                    :                           ## COLON
                    \s*
                    (?P<link>                   ## WE want anything here
                        [^$]+
                    )
                $
            )+                                  ## multiple labels
        )
    )
""",re.VERBOSE|re.I|re.M)

The specific problems I have:

  1. I can't figure out how to allow for 1 or more "blank lines". This triggers an invalid regex with nothing to repeat:

    (?: ## wrap it in a non-capturing group, require 1+ occurances ^[\t\ ]*$
    )+

  2. The match won't work without a whitespace match before the group \s+. I can't figure out what/why.

  3. I want this to match at the END of the document only , to ensure we're only fixing these javascript errors ( and not something at the core of the document ). all my attempts to work a \z into this have failed, miserably.

can anyone offer some advice?


updated

this works:

RE_MARKDOWN_footnote = re.compile("""
    (?P<labels_section>
        (?:                            ## we must start with an empty / whitepace-only line
            ^\s*$
        )                              
        \s*                             ## there can be more whitespace lines
        (?P<labels>
            (?P<a_label>
                ^
                    [\ \t]*                     ## we could have 0-n spaces or tabs
                    \[                          ## BRACKET - open
                        (?P<id>
                            [^^\]]+
                        )
                    \]                          ## BRACKET - close
                    \s*
                    :                           ## COLON
                    \s*
                    (?P<link>                   ## WE want anything here
                        [^$]+
                    )
                $
            )+                                  ## multiple labels
        )
        \s*                                     ## we might have some empty lines 
        \Z                                      ## ensure the end of document
    )
""",re.VERBOSE|re.I|re.M)
like image 273
Jonathan Vanasco Avatar asked Nov 10 '22 13:11

Jonathan Vanasco


1 Answers

I just started from scratch; is there a reason something simpler like this couldn't work?

^\s*                # beginning of the line; may include whitespace
  \[                # opening bracket
     (?P<id>\d+)    # our ID
  \]                # closing bracket
\s*                 # optional whitespace
  :                 # colon
\s*                 # optional whitespace
  (?P<link>[^\n]+)  # our link is everything up to a new line
$                   # end of the line

This was done using the global and multi-line modifiers, gm. Replace matches with: [\id]: \link. Here is a working example: http://regex101.com/r/mM8dI2

like image 91
Sam Avatar answered Nov 14 '22 21:11

Sam