Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pyparsing to parse a word escape-split over multiple lines

I'm trying to parse words which can be broken up over multiple lines with a backslash-newline combination ("\\n") using pyparsing. Here's what I have done:

from pyparsing import *

continued_ending = Literal('\\') + lineEnd
word = Word(alphas)
split_word = word + Suppress(continued_ending)
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))

print multi_line_word.parseString(
'''super\\
cali\\
fragi\\
listic''')

The output I get is ['super'], while the expected output is ['super', 'cali', fragi', 'listic']. Better still would be all of them joined as one word (which I think I can just do with multi_line_word.parseAction(lambda t: ''.join(t)).

I tried looking at this code in pyparsing helper, but it gives me an error, maximum recursion depth exceeded.

EDIT 2009-11-15: I realized later that pyparsing gets a little generous with regards to white space, and that leads to some poor assumptions that what I thought I was parsing for was a lot looser. That is to say, we want to see no white space between any of the portions of the word, the escape, and the EOL character.

I realized that the little example string above is insufficient as a test case, so I wrote the following unit tests. Code that passes these tests should be able to match what I intuitively think of as a escape-split word—and only an escape-split word. They will not match a basic word that is not escape-split. We can—and I believe should—use a different grammatical construct for that. This keeps it all tidy having the two separate.

import unittest
import pyparsing

# Assumes you named your module 'multiline.py'
import multiline

class MultiLineTests(unittest.TestCase):

    def test_continued_ending(self):

        case = '\\\n'
        expected = ['\\', '\n']
        result = multiline.continued_ending.parseString(case).asList()
        self.assertEqual(result, expected)


    def test_continued_ending_space_between_parse_error(self):

        case = '\\ \n'
        self.assertRaises(
            pyparsing.ParseException,
            multiline.continued_ending.parseString,
            case
        )


    def test_split_word(self):

        cases = ('shiny\\', 'shiny\\\n', ' shiny\\')
        expected = ['shiny']
        for case in cases:
            result = multiline.split_word.parseString(case).asList()
            self.assertEqual(result, expected)


    def test_split_word_no_escape_parse_error(self):

        case = 'shiny'
        self.assertRaises(
            pyparsing.ParseException,
            multiline.split_word.parseString,
            case
        )


    def test_split_word_space_parse_error(self):

        cases = ('shiny \\', 'shiny\r\\', 'shiny\t\\', 'shiny\\ ')
        for case in cases:
            self.assertRaises(
                pyparsing.ParseException,
                multiline.split_word.parseString,
                case
            )


    def test_multi_line_word(self):

        cases = (
                'shiny\\',
                'shi\\\nny',
                'sh\\\ni\\\nny\\\n',
                ' shi\\\nny\\',
                'shi\\\nny '
                'shi\\\nny captain'
        )
        expected = ['shiny']
        for case in cases:
            result = multiline.multi_line_word.parseString(case).asList()
            self.assertEqual(result, expected)


    def test_multi_line_word_spaces_parse_error(self):

        cases = (
                'shi \\\nny',
                'shi\\ \nny',
                'sh\\\n iny',
                'shi\\\n\tny',
        )
        for case in cases:
            self.assertRaises(
                pyparsing.ParseException,
                multiline.multi_line_word.parseString,
                case
            )


if __name__ == '__main__':
    unittest.main()
like image 532
gotgenes Avatar asked Nov 14 '09 21:11

gotgenes


1 Answers

After poking around for a bit more, I came upon this help thread where there was this notable bit

I often see inefficient grammars when someone implements a pyparsing grammar directly from a BNF definition. BNF does not have a concept of "one or more" or "zero or more" or "optional"...

With that, I got the idea to change these two lines

multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))

To

multi_line_word = ZeroOrMore(split_word) + word

This got it to output what I was looking for: ['super', 'cali', fragi', 'listic'].

Next, I added a parse action that would join these tokens together:

multi_line_word.setParseAction(lambda t: ''.join(t))

This gives a final output of ['supercalifragilistic'].

The take home message I learned is that one doesn't simply walk into Mordor.

Just kidding.

The take home message is that one can't simply implement a one-to-one translation of BNF with pyparsing. Some tricks with using the iterative types should be called into use.

EDIT 2009-11-25: To compensate for the more strenuous test cases, I modified the code to the following:

no_space = NotAny(White(' \t\r'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))

This has the benefit of making sure that no space comes between any of the elements (with the exception of newlines after the escaping backslashes).

like image 155
gotgenes Avatar answered Sep 24 '22 09:09

gotgenes