Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyparsing: nested Markdown emphasis

I'm noodling around with some simple Markdown text to play with and learn Pyparsing and grammars in general. I've run into a problem almost immediately that I'm having trouble solving. I'm trying to parse a simple version of the CommonMark spec for emphasis. In this setup, nested emphasis is allowed, so that

*foo *bar* baz*

should give:

<em>foo <em>bar</em> baz</em>

I've tried using a recursive definition to match this, but it's not working. Here's some sample code:

from pyparsing import *

text = Word(printables,excludeChars="*")
enclosed = Forward()
emphasis = QuotedString("*").setParseAction(lambda x: "<em>%s</em>" % x[0],contents=enclosed)
enclosed << emphasis | text

test = """
*foo *bar* bar*
"""

print emphasis.transformString(test)

But what I get back from this is:

<em>foo </em>bar<em> bar</em>

Forgive my noobishness; can someone point me in the right direction?

Edit:

In response to abarnert's great probing question, I'll provide clarification. I'm just playing around, so I can use an arbitrarily restricted form of the notation. I'll assume that only single '*'s occur, and that they don't occur next to each other. That leaves the whitespace to disambiguate: * not followed by whitespace opens emphasis, and * not preceeded by whitespace closes it.

Even with that, I'm not sure how to proceed with Pyparsing. Some sort of stack-based approach, pushing opening * and popping them when they validate as closing? How would one do that with Pyparsing? Or is there a more efficient approach?

like image 818
Winawer Avatar asked Jun 25 '26 05:06

Winawer


1 Answers

With those additional rules, I don't think you need to worry about the recursion at all, just handle the opening and closing emphasis expressions as they are found, whether they match up or not:

from pyparsing import *

openEmphasis = (LineStart() | White()) + Suppress('*')
openEmphasis.setParseAction(lambda x: ''.join(x.asList()+['<em>']))
closeEmphasis = '*' + FollowedBy(White() | LineEnd())
closeEmphasis.setParseAction(lambda x: '</em>')

emphasis = (openEmphasis | closeEmphasis).leaveWhitespace()

test = """
*foo *bar* bar*
"""
print test
print emphasis.transformString(test)

Prints:

*foo *bar* bar*

<em>foo <em>bar</em> bar</em>

You are not the first to trip over this kind of application. When I presented at PyCon'06, an eager attendee dove right in to parse out some markdown, with an input string something like "****a** b**** c**" or something. We worked on it a bit together, but the disambiguation rules were just too context-aware for a basic pyparsing parser to handle.

like image 84
PaulMcG Avatar answered Jun 27 '26 18:06

PaulMcG



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!