Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyparsing : white spaces sometimes matter... sometimes don't

I would like to create a grammar for a file that contains several sections (like PARAGRAPH below).

A section starts with its keyword (e.g. PARAGRAPH), is followed by a header (title here) and has its contents on the following lines, one line of content is a row of the section. As is, it is like a table with header, columns and rows.

In the example below (tablefile), I will limit the sections to have one column and one line.

Top-Down BNF of Tablefile:

tablefile := paragraph* paragraph := PARAGRAPH title CR              TAB content title, content := \w+ 

Pyparsing grammar :

As I need line breaks and tabultation to be handled, I will need to set default whitespaces to ' '.

def grammar():     '''     Bottom-up grammar definition     '''      ParserElement.setDefaultWhitespaceChars(' ')     TAB = White("\t").suppress()     CR = LineEnd().setName("Carriage Return").suppress()     PARAGRAPH = 'PARAGRAPH'      title = Word(alphas)     content = Word(alphas)     paragraph = (PARAGRAPH + title + CR                  + TAB + content)      tablefile = OneOrMore(paragraph)     tablefile.parseWithTabs()      return tablefile 

Applying to examples

This dummy example matches easily :

PARAGRAPH someTitle           thisIsContent 

This other less :

PARAGRAPH someTitle           thisIsContent PARAGRAPH otherTitle           thisIsOtherContent 

It waits for PARAGRAPH right after the first content, and stumble upon a line break (remember setDefaultWhitespaceChars(' ')). Am I compelled to add CR? at the end of a paragraph ? What would be a better way to ignore such last line breaks ?

Also, I would like to allow tabs and spaces to be anywhere in the file without disturbance. The only needed behavior is to starts a paragraph content with TAB, and PARAGRAPH to start the line. That would also mean skipping blank lines (with tabs and spaces or nothing) in and between paragraphs.

Thus I added this line :

tablefile.ignore(LineStart() + ZeroOrMore(White(' \t')) + LineEnd()) 

But every demand I just exposed, seems to be against my need of setting default whitespaces to ' ' and put me into a dead end.

Indeed, this would cause everything to break down :

tablefile.ignore(CR) tablefile.ignore(TAB) 

Glue PARAGRAPH and TAB to the start of line

If I want \t to be ignored as wherever in the text but at the start of lines. I will have to add them to the default white space characters.

Thus, I have found a way to forbid every white space character at the start of the line. By using leaveWhitespace method. This method keeps the whitespaces it encounters before matching the token. Hence, I can glue some tokens to the start of line.

ParserElement.setDefaultWhitespaceChars('\t ') SOL = LineStart().suppress() EOL = LineEnd().suppress()  title = Word() content = Word() PARAGRAPH = Keyword('PARAGRAPH').leaveWhitespace() TAB = Literal('\t').leaveWhitespace()  paragraph = (SOL + PARAGRAPH + title + EOL              + SOL + TAB + content + EOL) 

With this solution, I solved my problem with TABs wherever in the text.

Separating paragraphs

I reached the solution of PaulMcGuire (delimitedList) after a bit of thinking. And I encountered some issue with it.

Indeed, here are two different way of declaring line break separators between two paragraphs. In my opinion, they should be equivalent. In practice, they are not?

Crash test (don't forget to change the spaces with tabs if you run it):

PARAGRAPH titleone           content1 PARAGRAPH titletwo           content2 

Common part between the two examples :

ParserElement.setDefaultWhitespaceChars('\t ') SOL = LineStart().suppress() EOL = LineEnd().suppress()  title = Word() content = Word() PARAGRAPH = Keyword('PARAGRAPH').leaveWhitespace() TAB = Literal('\t').leaveWhitespace() 

First example, working one :

paragraph = (SOL + PARAGRAPH + title + EOL             + SOL + TAB + content + EOL)  tablefile = ZeroOrMore(paragraph) 

Second example, not working :

paragraph = (SOL + PARAGRAPH + title + EOL             + SOL + TAB + content)  tablefile = delimitedList(paragraph, delim=EOL) 

Shouldn't they be equivalent ? The second raise exception :

Expected end of text (at char 66), (line:4, col:1)

It is not a big issue for me, as I can finally back off to put EOL at the end of every paragraph-like section of my grammar. But I wanted to highlight this point.

Ignoring blank line containing white spaces

Another demand I had, was to ignore blank lines, containing whitespaces (' \t').

A simple grammar for this would be :

ParserElement.setDefaultWhitespaceChars(' \t') SOL = LineStart().suppress() EOL = LineEnd().suppress()  word = Word('a') entry = SOL + word + EOL  grammar = ZeroOrMore(entry) grammar.ignore(SOL + EOL) 

At the end, the file can contain one word per line, with any whitespace anywhere. And it should ignore blank lines.

Happily, it does. But it is not affected by default whitespaces declaration. And a blank line containing spaces or tabs will cause the parser to raise a parsing exception.

This behavior is absolutely not the one I expected. Is it the specified one ? Is there a bug under this simple attempt ?

I can see in this thread that PaulMcGuire did not tried to ignore blank lines but to tokenize them instead, in a makefile-like grammar parser (NL = LineEnd().suppress()).

Any python module for customized BNF parser?

makefile_parser = ZeroOrMore( symbol_assignment                              | task_definition                              | NL ) 

The only solution I have for now, is to preprocess the file and remove the whitespaces contained in a blank line as pyparsing correctly ignores blank line with no whitespace in it.

import os preprocessed_file = os.tmpfile()     with open(filename, 'r') as file:     for line in file:         # Use rstrip to preserve heading TAB at start of a paragraph line         preprocessed_file.write(line.rstrip() + '\n') preprocessed_file.seek(0)  grammar.parseFile(preprocessed_file, parseAll=True) 
like image 744
carrieje Avatar asked Apr 09 '14 12:04

carrieje


1 Answers

Your BNF contains only CR, but you parse the code to terminate using LF. Is that intended? BNF supports LF (Unix), CR (Mac), and CRLF (Win) EOLs:

Rule_|_Def.__|_Meaning___ CR   | %x0D  | carriage return LF   | %x0A  | linefeed CRLF | CR LF | Internet standard newline 
like image 193
Cees Timmerman Avatar answered Sep 20 '22 06:09

Cees Timmerman