Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a Brainfuck parser, whats the best method of parsing loop operators?

I'm creating a Brainfuck parser (in a BASIC dialect) ultimately to create an interpreter but i've realise it's not as straight forward as i first thought. My problem is that i need a way to accurately parse the matching loop operators within a Brainfuck program. This is an example program:

,>,>++++++++[<------<------>>-]
<<[>[>+>+<<-]>>[<<+>>-]<<<-]
>>>++++++[<++++++++>-],<.>.

'[' = start of loop

']' = end of loop

I need to record the start and end point of each matching loop operator so i can jump around the source as needed. Some loops are alone, some are nested.

What would be the best way to parse this? I was thinking maybe move through the source file creating a 2D array (or such like) recording the start and end positions of each matching operator, but this seems like a lot of 'to'ing and fro'ing' through the source. Is this the best way to do it?

More info: Brainfuck homepage

EDIT: Sample code in any language greatly appreciated.

like image 372
Gary Willoughby Avatar asked Jun 28 '09 20:06

Gary Willoughby


3 Answers

Interesting enough, just a couple days ago, I was writing a brainf*ck interpreter in Java.

One of the issues I was having was that the explanation of the commands at the official page was insufficient, and did not mention the part about nested loops. The Wikipedia page on Brainf*ck has a Commands subsection which describes the correct behavior.

Basically to summarize the problem, the official page says when an instruction is a [ and the current memory location is 0, then jump to the next ]. The correct behavior is to jump to the corresponding ], not the next one.

One way to achieve this behavior is to keep track of the level of nesting. I ended up implementing this by having a counter which kept track of the nesting level.

The following is part of the interpreter's main loop:

do {
  if (inst[pc] == '>') { ... }
  else if (inst[pc] == '<') { ... }
  else if (inst[pc] == '+') { ... }
  else if (inst[pc] == '-') { ... }
  else if (inst[pc] == '.') { ... }
  else if (inst[pc] == ',') { ... }
  else if (inst[pc] == '[') {
    if (memory[p] == 0) {
      int nesting = 0;

      while (true) {
        ++pc;

        if (inst[pc] == '[') {
          ++nesting;
          continue;
        } else if (nesting > 0 && inst[pc] == ']') {
          --nesting;
          continue;
        } else if (inst[pc] == ']' && nesting == 0) {
          break;
        }
      }
    }
  }
  else if (inst[pc] == ']') {
    if (memory[p] != 0) {
      int nesting = 0;

      while (true) {
        --pc;

        if (inst[pc] == ']') {
          ++nesting;
          continue;
        } else if (nesting > 0 && inst[pc] == '[') {
          --nesting;
          continue;
        } else if (inst[pc] == '[' && nesting == 0) {
          break;
        }
      }
    }
  }
} while (++pc < inst.length);

Here is the legend for the variable names:

  • memory -- the memory cells for the data.
  • p -- pointer to the current memory cell location.
  • inst -- an array holding the instructions.
  • pc -- program counter; points to the current instruction.
  • nesting -- level of the nesting of the current loop. nesting of 0 means that the current location is not in a nested loop.

Basically, when a loop opening [ is encountered, the current memory location is checked to see if the value is 0. If that is the case, a while loop is entered to jump to the corresponding ].

The way the nesting is handled is as follows:

  1. If an [ is encountered while seeking for the corresponding loop closing ], then the nesting variable is incremented by 1 in order to indicate that we have entered a nested loop.

  2. If an ] is encountered, and:

    a. If the nesting variable is greater than 0, then the nesting variable is decremented by 1 to indicate that we've left a nested loop.

    b. If the nesting variable is 0, then we know that the end of the loop has been encountered, so seeking the end of the loop in the while loop is terminated by executing a break statement.

Now, the next part is to handle the closing of the loop by ]. Similar to the opening of the loop, it will use the nesting counter in order to determine the current nesting level of the loop, and try to find the corresponding loop opening [.

This method may not be the most elegant way to do things, but it seems like it is resource-friendly because it only requires one extra variable to use as a counter for the current nesting level.

(Of course, "resource-friendly" is ignoring the fact that this interpreter was written in Java -- I just wanted to write some quick code and Java just happened to be what I wrote it in.)

like image 31
coobird Avatar answered Sep 28 '22 09:09

coobird


Have you considered using a Stack data structure to record "jump points" (i.e. the location of the instruction pointer).

So basically, every time you encounter a "[" you push the current location of the instruction pointer on this stack. Whenever you encounter a "]" you reset the instruction pointer to the value that's currently on the top of the stack. When a loop is complete, you pop it off the stack.

Here is an example in C++ with 100 memory cells. The code handles nested loops recursively and although it is not refined it should illustrate the concepts..

char cells[100] = {0};   // define 100 memory cells
char* cell = cells;      // set memory pointer to first cell
char* ip = 0;            // define variable used as "instruction pointer"

void interpret(static char* program, int* stack, int sp)
{
    int tmp;
    if(ip == 0)              // if the instruction pointer hasn't been initialized
        ip = program;        //  now would be a good time

    while(*ip)               // this runs for as long as there is valid brainF**k 'code'
    {
        if(*ip == ',')
            *cell = getch();
        else if(*ip == '.')
            putch(*cell);
        else if(*ip == '>')
            cell++;
        else if(*ip == '<')
            cell--;
        else if(*ip == '+')
            *cell = *cell + 1;
        else if(*ip == '-')
            *cell = *cell - 1;
        else if(*ip == '[')
        {           
            stack[sp+1] = ip - program;
            *ip++;
            while(*cell != 0)
            {
                interpret(program, stack, sp + 1);
            }
            tmp = sp + 1;
            while((tmp >= (sp + 1)) || *ip != ']')
            {
                *ip++;
                if(*ip == '[')
                    stack[++tmp] = ip - program;
                else if(*ip == ']')
                    tmp--;
            }           
        }
        else if(*ip == ']')
        {
            ip = program + stack[sp] + 1;
            break;
        }
        *ip++;       // advance instruction
    }
}

int _tmain(int argc, _TCHAR* argv[])
{   
    int stack[100] = {0};  // use a stack of 100 levels, modeled using a simple array
    interpret(",>,>++++++++[<------<------>>-]<<[>[>+>+<<-]>>[<<+>>-]<<<-]>>>++++++[<++++++++>-],<.>.", stack, 0);
    return 0;
}

EDIT I just went over the code again and I realized there was a bug in the while loop that would 'skip' parsed loops if the value of the pointer is 0. This is where I made the change:

while((tmp >= (sp + 1)) || *ip != ']')     // the bug was tmp > (sp + 1)
{
ip++;
if(*ip == '[')
    stack[++tmp] = ip - program;
else if(*ip == ']')
    tmp--;
}

Below is an implementation of the same parser but without using recursion:

char cells[100] = {0};
void interpret(static char* program)
{
    int cnt;               // cnt is a counter that is going to be used
                           //     only when parsing 0-loops
    int stack[100] = {0};  // create a stack, 100 levels deep - modeled
                           //     using a simple array - and initialized to 0
    int sp = 0;            // sp is going to be used as a 'stack pointer'
    char* ip = program;    // ip is going to be used as instruction pointer
                           //    and it is initialized at the beginning or program
    char* cell = cells;    // cell is the pointer to the 'current' memory cell
                           //      and as such, it is initialized to the first
                           //      memory cell

    while(*ip)             // as long as ip point to 'valid code' keep going
    {
        if(*ip == ',')
            *cell = getch();
        else if(*ip == '.')
            putch(*cell);
        else if(*ip == '>')
            cell++;
        else if(*ip == '<')
            cell--;
        else if(*ip == '+')
            *cell = *cell + 1;
        else if(*ip == '-')
            *cell = *cell - 1;
        else if(*ip == '[')
        {           
            if(stack[sp] != ip - program)
                stack[++sp] = ip - program;

            *ip++;

            if(*cell != 0)
                continue;
            else
            {                   
                cnt = 1;
                while((cnt > 0) || *ip != ']')
                {
                    *ip++;
                    if(*ip == '[')
                    cnt++;
                    else if(*ip == ']')
                    cnt--;
                }
                sp--;
            }
        }else if(*ip == ']')
        {               
            ip = program + stack[sp];
            continue;
        }
        *ip++;
    }
}

int _tmain(int argc, _TCHAR* argv[])
{   
    // define our program code here..
    char *prg = ",>++++++[<-------->-],[<+>-]<.";

    interpret(prg);
    return 0;
}
like image 150
Mike Dinescu Avatar answered Sep 28 '22 08:09

Mike Dinescu


The canonical method for parsing a context-free grammar is to use a stack. Anything else and you're working too hard and risking correctness.

You may want to use a parser generator like cup or yacc, as a lot of the dirty work is done for you, but with a language as simple as BF, it may be overkill.

like image 38
Chris Avatar answered Sep 28 '22 09:09

Chris