Using pyparsing to parse a word escape-split over multiple lines

Question

I'm trying to parse words which can be broken up over multiple lines with a backslash-newline combination ("\n") using pyparsing. Here's what I have done:

from pyparsing import *

continued_ending = Literal('\') + lineEnd
word = Word(alphas)
split_word = word + Suppress(continued_ending)
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))

print multi_line_word.parseString(
'''super\
cali\
fragi\
listic''')

The output I get is ['super'], while the expected output is ['super', 'cali', fragi', 'listic']. Better still would be all of them joined as one word (which I think I can just do with multi_line_word.parseAction(lambda t: ''.join(t)).

I tried looking at this code in pyparsing helper, but it gives me an error, maximum recursion depth exceeded.

EDIT 2009-11-15: I realized later that pyparsing gets a little generous with regards to white space, and that leads to some poor assumptions that what I thought I was parsing for was a lot looser. That is to say, we want to see no white space between any of the portions of the word, the escape, and the EOL character.

I realized that the little example string above is insufficient as a test case, so I wrote the following unit tests. Code that passes these tests should be able to match what I intuitively think of as a escape-split word—and only an escape-split word. They will not match a basic word that is not escape-split. We can—and I believe should—use a different grammatical construct for that. This keeps it all tidy having the two separate.

import unittest
import pyparsing

# Assumes you named your module 'multiline.py'
import multiline

class MultiLineTests(unittest.TestCase):

    def test_continued_ending(self):

        case = '\
'
        expected = ['\', '
']
        result = multiline.continued_ending.parseString(case).asList()
        self.assertEqual(result, expected)


    def test_continued_ending_space_between_parse_error(self):

        case = '\ 
'
        self.assertRaises(
            pyparsing.ParseException,
            multiline.continued_ending.parseString,
            case
        )


    def test_split_word(self):

        cases = ('shiny\', 'shiny\
', ' shiny\')
        expected = ['shiny']
        for case in cases:
            result = multiline.split_word.parseString(case).asList()
            self.assertEqual(result, expected)


    def test_split_word_no_escape_parse_error(self):

        case = 'shiny'
        self.assertRaises(
            pyparsing.ParseException,
            multiline.split_word.parseString,
            case
        )


    def test_split_word_space_parse_error(self):

        cases = ('shiny \', 'shiny
\', 'shiny	\', 'shiny\ ')
        for case in cases:
            self.assertRaises(
                pyparsing.ParseException,
                multiline.split_word.parseString,
                case
            )


    def test_multi_line_word(self):

        cases = (
                'shiny\',
                'shi\
ny',
                'sh\
i\
ny\
',
                ' shi\
ny\',
                'shi\
ny '
                'shi\
ny captain'
        )
        expected = ['shiny']
        for case in cases:
            result = multiline.multi_line_word.parseString(case).asList()
            self.assertEqual(result, expected)


    def test_multi_line_word_spaces_parse_error(self):

        cases = (
                'shi \
ny',
                'shi\ 
ny',
                'sh\
 iny',
                'shi\
	ny',
        )
        for case in cases:
            self.assertRaises(
                pyparsing.ParseException,
                multiline.multi_line_word.parseString,
                case
            )


if __name__ == '__main__':
    unittest.main()

gotgenes · Accepted Answer

After poking around for a bit more, I came upon this help thread where there was this notable bit

I often see inefficient grammars when someone implements a pyparsing grammar directly from a BNF definition. BNF does not have a concept of "one or more" or "zero or more" or "optional"...

With that, I got the idea to change these two lines

multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))

To

multi_line_word = ZeroOrMore(split_word) + word

This got it to output what I was looking for: ['super', 'cali', fragi', 'listic'].

Next, I added a parse action that would join these tokens together:

multi_line_word.setParseAction(lambda t: ''.join(t))

This gives a final output of ['supercalifragilistic'].

The take home message I learned is that one doesn't simply walk into Mordor.

Just kidding.

The take home message is that one can't simply implement a one-to-one translation of BNF with pyparsing. Some tricks with using the iterative types should be called into use.

EDIT 2009-11-25: To compensate for the more strenuous test cases, I modified the code to the following:

no_space = NotAny(White(' 	
'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))

This has the benefit of making sure that no space comes between any of the elements (with the exception of newlines after the escaping backslashes).

Using pyparsing to parse a word escape-split over multiple lines

Tags:

python

parsing

pyparsing

gotgenes

1 Answers

gotgenes

Recent Activity

Donate For Us

Using pyparsing to parse a word escape-split over multiple lines

Tags:

python

parsing

pyparsing

gotgenes

1 Answers

gotgenes

Related questions

Recent Activity

Donate For Us