I would like to create a grammar for a file that contains several sections (like PARAGRAPH below).
A section starts with its keyword (e.g. PARAGRAPH), is followed by a header (title here) and has its contents on the following lines, one line of content is a row of the section. As is, it is like a table with header, columns and rows.
In the example below (tablefile), I will limit the sections to have one column and one line.
tablefile := paragraph* paragraph := PARAGRAPH title CR TAB content title, content := \w+
As I need line breaks and tabultation to be handled, I will need to set default whitespaces to ' '.
def grammar(): ''' Bottom-up grammar definition ''' ParserElement.setDefaultWhitespaceChars(' ') TAB = White("\t").suppress() CR = LineEnd().setName("Carriage Return").suppress() PARAGRAPH = 'PARAGRAPH' title = Word(alphas) content = Word(alphas) paragraph = (PARAGRAPH + title + CR + TAB + content) tablefile = OneOrMore(paragraph) tablefile.parseWithTabs() return tablefile
This dummy example matches easily :
PARAGRAPH someTitle thisIsContent
This other less :
PARAGRAPH someTitle thisIsContent PARAGRAPH otherTitle thisIsOtherContent
It waits for PARAGRAPH
right after the first content, and stumble upon a line break (remember setDefaultWhitespaceChars(' ')
). Am I compelled to add CR?
at the end of a paragraph
? What would be a better way to ignore such last line breaks ?
Also, I would like to allow tabs and spaces to be anywhere in the file without disturbance. The only needed behavior is to starts a paragraph content with TAB
, and PARAGRAPH
to start the line. That would also mean skipping blank lines (with tabs and spaces or nothing) in and between paragraphs.
Thus I added this line :
tablefile.ignore(LineStart() + ZeroOrMore(White(' \t')) + LineEnd())
But every demand I just exposed, seems to be against my need of setting default whitespaces to ' '
and put me into a dead end.
Indeed, this would cause everything to break down :
tablefile.ignore(CR) tablefile.ignore(TAB)
If I want \t
to be ignored as wherever in the text but at the start of lines. I will have to add them to the default white space characters.
Thus, I have found a way to forbid every white space character at the start of the line. By using leaveWhitespace
method. This method keeps the whitespaces it encounters before matching the token. Hence, I can glue some tokens to the start of line.
ParserElement.setDefaultWhitespaceChars('\t ') SOL = LineStart().suppress() EOL = LineEnd().suppress() title = Word() content = Word() PARAGRAPH = Keyword('PARAGRAPH').leaveWhitespace() TAB = Literal('\t').leaveWhitespace() paragraph = (SOL + PARAGRAPH + title + EOL + SOL + TAB + content + EOL)
With this solution, I solved my problem with TABs wherever in the text.
I reached the solution of PaulMcGuire (delimitedList
) after a bit of thinking. And I encountered some issue with it.
Indeed, here are two different way of declaring line break separators between two paragraphs. In my opinion, they should be equivalent. In practice, they are not?
Crash test (don't forget to change the spaces with tabs if you run it):
PARAGRAPH titleone content1 PARAGRAPH titletwo content2
Common part between the two examples :
ParserElement.setDefaultWhitespaceChars('\t ') SOL = LineStart().suppress() EOL = LineEnd().suppress() title = Word() content = Word() PARAGRAPH = Keyword('PARAGRAPH').leaveWhitespace() TAB = Literal('\t').leaveWhitespace()
First example, working one :
paragraph = (SOL + PARAGRAPH + title + EOL + SOL + TAB + content + EOL) tablefile = ZeroOrMore(paragraph)
Second example, not working :
paragraph = (SOL + PARAGRAPH + title + EOL + SOL + TAB + content) tablefile = delimitedList(paragraph, delim=EOL)
Shouldn't they be equivalent ? The second raise exception :
Expected end of text (at char 66), (line:4, col:1)
It is not a big issue for me, as I can finally back off to put EOL at the end of every paragraph-like section of my grammar. But I wanted to highlight this point.
Another demand I had, was to ignore blank lines, containing whitespaces (' \t'
).
A simple grammar for this would be :
ParserElement.setDefaultWhitespaceChars(' \t') SOL = LineStart().suppress() EOL = LineEnd().suppress() word = Word('a') entry = SOL + word + EOL grammar = ZeroOrMore(entry) grammar.ignore(SOL + EOL)
At the end, the file can contain one word per line, with any whitespace anywhere. And it should ignore blank lines.
Happily, it does. But it is not affected by default whitespaces declaration. And a blank line containing spaces or tabs will cause the parser to raise a parsing exception.
This behavior is absolutely not the one I expected. Is it the specified one ? Is there a bug under this simple attempt ?
I can see in this thread that PaulMcGuire did not tried to ignore blank lines but to tokenize them instead, in a makefile-like grammar parser (NL = LineEnd().suppress()
).
Any python module for customized BNF parser?
makefile_parser = ZeroOrMore( symbol_assignment | task_definition | NL )
The only solution I have for now, is to preprocess the file and remove the whitespaces contained in a blank line as pyparsing correctly ignores blank line with no whitespace in it.
import os preprocessed_file = os.tmpfile() with open(filename, 'r') as file: for line in file: # Use rstrip to preserve heading TAB at start of a paragraph line preprocessed_file.write(line.rstrip() + '\n') preprocessed_file.seek(0) grammar.parseFile(preprocessed_file, parseAll=True)
Your BNF contains only CR, but you parse the code to terminate using LF. Is that intended? BNF supports LF (Unix), CR (Mac), and CRLF (Win) EOLs:
Rule_|_Def.__|_Meaning___ CR | %x0D | carriage return LF | %x0A | linefeed CRLF | CR LF | Internet standard newline
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With