I want to parse a file with lines of the following content:
simple word abbr -8. (012) word, simple phrase, one another phrase - (simply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!; "It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose") This is the end of the line
so now explaining the parts (not all spaces are described, because of the markup here):
simple word
is one or several words (phrase) separated by the whitespaceabbr -
is a fixed part of the string (never changes)8
- optional number.
- always includedword, simple phrase, one another phrase
- one or several words or phrases separated by comma- (
- fixed part, always includedsimply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!;
- (optional) one or several phrases separated by ;
"It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose"
- (optional) one or several phrases with quotation marks "
separated by ;
) This is the end of the line
- always includedIn the worst case there are no phrases in clause, but this is uncommon: there should be a phrase without augmenting quotation marks (phrase1
type) or with them (phrase2
type).
So the phrases are Natural Language sentences (with all the punctuation possible)...
BUT:
phrase1
or phrase2
types:
(
and ;
or ;
and ;
or ;
and )
or even between (
and )
is augmented with quotation marks, then it is phrase2
typephrase1
typeSince to write a Regex (PCRE) for such an input is an overkill, so I looked to the parsing approach (EBNF or similar). I ended up with a PEG.js parser generator. I created a basic grammar variants (even not handling the part with the different phrases in the clause):
start = term _ "abbr" _ "-" .+
term = word (_? word !(_ "abbr" _ "-"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
or (the difference is only in " abbr -"
and "_ "abbr" _ "-""
):
start = term " abbr -" .+
term = word (_? word !(" abbr -"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
But even this simple grammar cannot parse the beginning of the string. The errors are:
Parse Error Expected [A-Za-z] but " " found.
Parse Error Expected "abbr" but "-" found.
So it looks the problem is in the ambiguity: "abbr"
is consumed withing the term
as a word
token. Although I defined the rule !(" abbr -")
, which I thought has a meaning, that the next word
token will be only consumed, if the next substring is not of " abbr -"
kind.
I didn't found any good examples explaining the following expressions of PEG.js, which seems to me as a possible solution of the aforementioned problem [from: http://pegjs.majda.cz/documentation]:
& expression
! expression
$ expression
& { predicate }
! { predicate }
related to PEG.js:
are there any examples of applying the rules:
& expression
! expression
$ expression
& { predicate }
! { predicate }
general question:
-
char anymore.)I found the rule which solves the problem with matching the "abbr -"
ambiguity:
term = term:(word (!" abbr -" _? word))+ {return term.join("")}
but the result looks strange:
[
"simple, ,word",
" abbr -",
[
"8",
...
],
...
]
if removing the predicate: term = term:(word (!" abbr -" _? word))+
:
[
[
"simple",
[
[
undefined,
[
" "
],
"word"
]
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
I expected something like:
[
[
"simple word"
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
or at least:
[
[
"simple",
[
" ",
"word"
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
The expression is grouped, so why is it separated in so many nesting levels and even undefined
is included in the output? Are there any general rules to fold the result based on the expression in the rule?
I created the grammar so that it parses as desired, though I didn't yet identified the clear process of such a grammar creation:
start
= (term:term1 (" abbr -" number "." _ "("number:number") "{return number}) terms:terms2 ((" - (" phrases:phrases ")" .+){return phrases}))
//start //alternative way = looks better
// = (term:term1 " abbr -" number "." _ "("number:number") " terms:terms2 " - (" phrases:phrases ")" .+){return {term: term, number: number, phrases:phrases}}
term1
= term1:(
start_word:word
(rest_words:(
rest_word:(
(non_abbr:!" abbr -"{return non_abbr;})
(space:_?{return space[0];}) word){return rest_word.join("");})+{return rest_words.join("")}
)) {return term1.join("");}
terms2
= terms2:(start_word:word (rest_words:(!" - (" ","?" "? word)+){rest_words = rest_words.map(function(array) {
return array.filter(function(n){return n != null;}).join("");
}); return start_word + rest_words.join("")})
phrases
// = ((phrase_t:(phrase / '"' phrase '"') ";"?" "?){return phrase_t})+
= (( (phrase:(phrase2 / phrase1) ";"?" "?) {return phrase;})+)
phrase2
= (('"'pharse2:(phrase)'"'){return {phrase2: pharse2}})
phrase1
= ((pharse1:phrase){return {phrase1: pharse1}})
phrase
= (general_phrase:(!(';' / ')' / '";' / '")') .)+ ){return general_phrase.map(function(array){return array[1]}).join("")}
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
number = digits:digit+{return digits.join("")}
digit = [0-9]
_ "whitespace"
= [ \t\n\r]*
It could be tested either on the PEG.js author's site: [http://pegjs.majda.cz/online] or on the PEG.js Web-IDE: [http://peg.arcanis.fr/]
If somebody has answers for the previous questions (i.e. general approach for disambiguation of the grammar, examples to the expressions available in PEG.js) as well as improvement advices to the grammar itself (this is I think far away from an ideal grammar now), I would very appreciate!
so why is it separated in so many nesting levels and even undefined is included in the output?
If you look at the documentation for PEG.js, you'll see almost every operator collects the results of its operands into an array. undefined
is returned by the !
operator.
The $
operator bypasses all this nesting and just gives you the actual string that matches, eg: [a-z]+
will give an array of letters, but $[a-z]+
will give a string of letters.
I think most of the parsing here follows the pattern: "give me everything until I see this string". You should express this in PEG by first using !
to make sure you haven't hit the terminating string, and then just taking the next character. For example, to get everything up to " abbr -":
(!" abbr -" .)+
If the terminating string is a single character, you can use [^]
as a short form of this, eg: [^x]+
is a shorter way of saying (!"x" .)+
.
Parsing comma/semicolon separated phrases rather than comma/semicolon terminated phrases is slightly annoying, but treating them as optional terminators seems to work (with some trim
ing).
start = $(!" abbr -" .)+ " abbr -" $num "." [ ]? "(012)"
phrase_comma+ "- (" noq_phrase_semi+ q_phrase_semi+ ")"
$.*
phrase_comma = p:$[^-,]+ [, ]* { return p.trim() }
noq_phrase_semi = !'"' p:$[^;]+ [; ]* { return p.trim() }
q_phrase_semi = '"' p:$[^"]+ '"' [; ]* { return p }
num = [0-9]+
gives
[
"simple word",
" abbr -",
"8",
".",
" ",
"(012)",
[
"word",
"simple phrase",
"one another phrase"
],
"- (",
[
"simply dummy text of the printing",
"Lorem Ipsum : \"Lorem\" - has been the industry's standard dummy text, ever since the 1500s!"
],
[
"It is a long established!",
"Sometimes by accident, sometimes on purpose (injected humour and the like)",
"sometimes on purpose"
],
")",
" This is the end of the line"
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With