How to generate random programs from BNF

Question

I know my question sounds a little vague, but I could not find any tutorials online. I am not asking for an answer, but for more of an explanation. An example of the BNF:

<prog> ::= “int main() { <stat_list> return 0; }”
<stat_list>  ::= <stat>
         | <stat_list> <stat>
<stat>       ::= <cmpd_stat>
         | <if_stat>
         | <iter_stat>
         | <assgn_stat>
         | <decl_stat>
<cmpd_stat>  ::= { <stat_list> }
<if_stat>    ::= if ( <exp> ) <stat>
         | if ( <exp> ) <cmpd_stat>
         | if ( <exp> ) <stat> else <stat>
         | if ( <exp> ) <cmpd_stat> else <stat>
         | if ( <exp> ) <stat> else <cmpd_stat>
         | if ( <exp> ) <cmpd_stat> else <cmpd_stat>

What would be the easiest way to convert this into python to have my program create a random program using the conditions above? Any help of links to useful websites would be greatly appreciated.

Daniel Rodríguez · Accepted Answer

NLTK has a package for grammars. Normally used for sentences analysis, but nothing stops you to using it for create a "program" following that rules.

I think NLTK only let you define a Context Free Grammar, so I'm leaving you here a little example I did:

from nltk import CFG
from nltk.parse.generate import generate

#Define your grammar from string
#You can define it using other methods, but I only know this xD

grammar = CFG.fromstring(""" S -> NP VP
  VP -> V NP
  V -> "mata" | "roba"
  NP -> Det N | NP NP
  Det -> "un" | "el" | "con" | "a" | "una"
  N -> "bebé" | "ladrón" | "Obama" | "perrete" | "navastola" | "navaja" | "pistola" """)

''' This grammar creates sentences like:
        El bebé roba a Obama
        Baby steals Obama (in spanish)
'''
#With this we "create" all the possible combinations
grammar.productions()

#Here you can see all the productions (sentences) with 5 words
#created with this grammar
for production in generate(grammar, depth=5):
    print(' '.join(production))

Ira Baxter · Answer

You can do this by abusing a parser to turn it into a generator.

First, build a recursive parser for your language. (See my SO answer on how to do just that). pause while you read that .... I now assume you understand how to do that.

You'll note that such a parser is full of calls from the parser function for one grammar rule, to other functions for other grammar rules or primitive token matchers.

What you want to do is modify each call to decide that it will return "true" with some low probability if there is some alternative still available in the function, before the call is made. If a call decides on false, control simply passes to another part of the parser. if a call decides true, it actually makes the call; the callee must now act in a way that will return true and generate corresponding source code. At some point, this will force a call to a token reader to return true; the token reader gets replaced by a print function that emits a random token. What actually happens when you do this is that calls to decide if something is true now simply become calls; we no longer need a return status because the called function must return true. This changes our functions-returning-bools into procedures-returning-void. See the example below..

Let's try an example with this simple grammar for a simple programming language p:

p = s ;
s = v '=' e ;
s = 'if' e 'then' s ;
e = v ;
e = v '+' n ;

OK, our recursive descent parser for p (I'm not a Python guy, so this is psuedocode):

function p() { return s(); } // no alternatives
function s() { if v()
               then if match("=")
                    then return e()
                    else return false;
               else if match("if")
                    then if e()
                         then if match("then")
                              then return s()
                              else return false;
                         else return false;
                    else return false;
              }
 function e() { if v()
                then if match ("+")
                     then if n()
                     else return true
                else return false
              }
 function v() { return match_variable_name(); }
 function n() { return match_integer_constant(); }

OK, now lets force the calls to decide if they are going to succeed using a coin flip function that randomly returns true or false. Any construct of the form:

          if <testsomething> then <action x> else <action y>

gets turned into:

          if flip() then  { <testsomething> <action x> } else <action y>

and any construct of the form:

          if  <testsomething> then <action x> else return false

gets turned into

          { <testsomething>; <action x> }

because it must succeed if we are to generate a parsable programs.

If testsomething is a function-call to another grammer rule, we leave it alone. Function calls to primitive token matches get turned into print statements: if testsomething is "match(Q)", then replace it by "print(Q)"; this is what actually generates a piece of the program.

procedure p() { s(); } // no choice, this has to succeed
procedure s() { if flip() // flip == true --> v must succeed
               then { v();
                      print("=") // because if no match, procedure fails
                      e();
                    }
               else { print("if")  // if we get here, must succeed
                      e();
                      print("then"); // because match("then") must succeed
                      s();
                    }
              }
 procedure e() { v(); // because there are no alternatives
                 if flip() then { print("+");
                                  n();
                                }
                 else { }
               }
 procedure v() { print_variable_name(); }
 procedure n() { print_integer_constant(); }

Note that the token recognizers for variable name and integer constants, now become print procedures that print random variable names/constants. This is essentially just pushing "flip" into those procedures, too.

Now this may print arbitrarily long programs because flip may force s to call itself repeatedly. If flip is 50-50, your chances of 10 recursions in 1 in a 1000 so probably ok. However, you might decide to bias each individual flip to choose the shorter phrase, based on the size of the output generated so far, or the depth of any recursion.

Now, what this won't do in the general case produce semantically correct programs. That's because our parser is "context free"; there are no constraints on one part of the generated code forced by other parts. As an example, if your language had to declare a variable before using it, this scheme doesn't guarantee that a declaration for random-var-X will be produced before randome-var-X appears in an expression.

There's no easy way to fix this, because language semantics aren't gauranteed to be "easy". Just goes to show that parsing a program ("technically easy") and checking for correct semantics ("arbitrarily hard", consider C++), leads to any equally hard problem of generating random program that doesn't violate langauge semantics.

How to generate random programs from BNF

Tags:

python

parsing

code-generation

bnf

John Doee

2 Answers

Daniel Rodríguez

Ira Baxter

Recent Activity

Donate For Us

How to generate random programs from BNF

Tags:

python

parsing

code-generation

bnf

John Doee

2 Answers

Daniel Rodríguez

Ira Baxter

Related questions

Recent Activity

Donate For Us