I would like to extract information from a body of text and be able to query it.
The structure of this body of text would be specified by a BNF grammar (or variant), and the information to extract would be specified at runtime (the syntax of the query does not matter at the moment).
So the requirements are simple, really:
To illustrate with an example, suppose that we have such grammar (in a customized BNF format):
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<id> ::= 15 * digit
<hex> ::= 10 * (<digit> | a | b | c | d | e | f)
<anything> ::= <digit> | .... (all characters)
<match> ::= <id> (" " <hex>)*
<nomatch> ::= "." <anything>*
<line> ::= (<match> | <nomatch> | "") [<CR>] <LF>
<text> ::= <line>+
For which such text would be conforming:
012345678901234
012345678901234 abcdef0123
Nor the previous line nor this one would match
And then I would want to list all tags that appear in the rule, so for example using an XPath like syntax:
match//id
which would return a list.
This sounds relatively easy, except that I have two big constraints:
Some precisions:
I know of lex/yacc and flex/bison for example, however they appear to only create C / C++ code to be compiled, which is not what I am looking after.
Do you know of a robust library (preferably free and open-source) that can transform a BNF grammar into a parser "on-the-fly" and produce a structured in-memory output from a body of text using this parser ?
EDIT: I am open to alternatives. At the moment, the idea was that perhaps regexes could allow this extraction, however given the complexity of the grammars involved, this could get ugly quickly and thus maintaining the regexes would be quite a horrendous task. Furthermore, by separating grammars and extraction I hope to be able to reuse the same grammar for different extractions needs rather than having slightly different regexes each time.
I have a proprietary solution that can convert grammar source into an in memory representation. The result is a pure data structure. Any code can use it. I also have C++ class that actually implements the parser. Rule handlers are implemented as virtual methods.
The primary difference between our solution and YACC/Bison is that no C/C++ code is generated. This means that grammar can be reloaded without recompiling the app. The grammar can be annotated with application IDs that are used in the code of the rule handlers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With