Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting information using BNF grammars

I would like to extract information from a body of text and be able to query it.

The structure of this body of text would be specified by a BNF grammar (or variant), and the information to extract would be specified at runtime (the syntax of the query does not matter at the moment).

So the requirements are simple, really:

  • Receive some structured body of text
  • Load it in an exploitable form using a grammar to parse it
  • Run a query to select some portions of it

To illustrate with an example, suppose that we have such grammar (in a customized BNF format):

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<id> ::= 15 * digit

<hex> ::= 10 * (<digit> | a | b | c | d | e | f)

<anything> ::= <digit> | .... (all characters)

<match> ::= <id> (" " <hex>)*

<nomatch> ::= "." <anything>*

<line> ::= (<match> | <nomatch> | "") [<CR>] <LF>

<text> ::= <line>+

For which such text would be conforming:

012345678901234
012345678901234 abcdef0123

Nor the previous line nor this one would match

And then I would want to list all tags that appear in the rule, so for example using an XPath like syntax:

match//id

which would return a list.


This sounds relatively easy, except that I have two big constraints:

  • the BNF grammar should be read at runtime (from a string/vector like structure)
  • the queries will be read at runtime too

Some precisions:

  • the grammar is not expected to change often so a "compilation" step to produce an in-memory structure is acceptable (and perhaps necessary to achieve good speed)
  • speed is of the essence, bonus points for on-the-fly collection of the wanted portions
  • bonus points for the possibility to have callbacks to disambiguate (sometimes the necessary disambiguation information might require DB access for example)
  • bonus points for multipart grammars (favoring modularity and reuse of grammar elements)

I know of lex/yacc and flex/bison for example, however they appear to only create C / C++ code to be compiled, which is not what I am looking after.

Do you know of a robust library (preferably free and open-source) that can transform a BNF grammar into a parser "on-the-fly" and produce a structured in-memory output from a body of text using this parser ?

EDIT: I am open to alternatives. At the moment, the idea was that perhaps regexes could allow this extraction, however given the complexity of the grammars involved, this could get ugly quickly and thus maintaining the regexes would be quite a horrendous task. Furthermore, by separating grammars and extraction I hope to be able to reuse the same grammar for different extractions needs rather than having slightly different regexes each time.

like image 505
Matthieu M. Avatar asked Jun 12 '12 14:06

Matthieu M.


1 Answers

I have a proprietary solution that can convert grammar source into an in memory representation. The result is a pure data structure. Any code can use it. I also have C++ class that actually implements the parser. Rule handlers are implemented as virtual methods.

The primary difference between our solution and YACC/Bison is that no C/C++ code is generated. This means that grammar can be reloaded without recompiling the app. The grammar can be annotated with application IDs that are used in the code of the rule handlers.

like image 194
Kirill Kobelev Avatar answered Oct 05 '22 03:10

Kirill Kobelev