I'm making my own javascript-based programming language (yeah, it is crazy, but it's for learn only... maybe?). Well, I'm reading about parsers and the first pass is to convert the code source to tokens, like: <pre class="prettyprint"><code>if(x > 5) return true; </code></pre> Tokenizer to: <pre class="prettyprint"><code>T_IF "if" T_LPAREN "(" T_IDENTIFIER "x" T_GT ">" T_NUMBER "5" T_RPAREN ")" T_IDENTIFIER "return" T_TRUE "true" T_TERMINATOR ";" </code></pre> I don't know if my logic is correct for that for while. On my parser it is even better (or not?) and translate to it (yeah, multidimensional array): <pre class="prettyprint"><code>T_IF "if" T_EXPRESSION ... T_IDENTIFIER "x" T_GT ">" T_NUMBER "5" T_CLOSURE ... T_IDENTIFIER "return" T_TRUE "true" </code></pre> I have some doubts: <ol> <li>Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.</li> <li>After I tokenizer, what I need do exactly? I'm really lost on this pass!</li> <li>There are some good tutorial to learn how I can do it?</li> </ol> Well, is that. Bye!

Let's say I have this statement in a programming language: <pre class="prettyprint"><code>if (0 < 1) then print("Hello") </code></pre> The lexer will translate it into: <pre class="prettyprint"><code>keyword: if num: 0 op: < num: 1 keyword: then keyword: print string: "Hello" </code></pre> The parser will then take the information (aka "Token Stream") and make this: <pre class="prettyprint"><code>if: expression: <: 0, 1 then: print: "Hello" </code></pre> I don't know if this will help or not, but I hope it does.

Building a parser (Part I)

Tags:

parsing

programming-languages

translate

I'm making my own javascript-based programming language (yeah, it is crazy, but it's for learn only... maybe?). Well, I'm reading about parsers and the first pass is to convert the code source to tokens, like:

if(x > 5)
  return true;

Tokenizer to:

T_IF          "if"
T_LPAREN      "("
T_IDENTIFIER  "x"
T_GT          ">"
T_NUMBER      "5"
T_RPAREN      ")"
T_IDENTIFIER  "return"
T_TRUE        "true"
T_TERMINATOR  ";"

I don't know if my logic is correct for that for while. On my parser it is even better (or not?) and translate to it (yeah, multidimensional array):

T_IF             "if"
  T_EXPRESSION     ...
    T_IDENTIFIER     "x"
    T_GT             ">"
    T_NUMBER         "5"
  T_CLOSURE        ...
    T_IDENTIFIER     "return"
    T_TRUE           "true"

I have some doubts:

Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
There are some good tutorial to learn how I can do it?

Well, is that. Bye!

659

asked Feb 26 '12 11:02

David Rodrigues

4 Answers

Generally, you want to separate the functions of the tokeniser (also called a lexer) from other stages of your compiler or interpreter. The reason for this is basic modularity: each pass consumes one kind of thing (e.g., characters) and produces another one (e.g., tokens).

So you’ve converted your characters to tokens. Now you want to convert your flat list of tokens to meaningful nested expressions, and this is what is conventionally called parsing. For a JavaScript-like language, you should look into recursive descent parsing. For parsing expressions with infix operators of different precedence levels, Pratt parsing is very useful, and you can fall back on ordinary recursive descent parsing for special cases.

Just to give you a more concrete example based on your case, I’ll assume you can write two functions: accept(token) and expect(token), which test the next token in the stream you’ve created. You’ll make a function for each type of statement or expression in the grammar of your language. Here’s Pythonish pseudocode for a statement() function, for instance:

def statement():

  if accept("if"):
    x = expression()
    y = statement()
    return IfStatement(x, y)

  elif accept("return"):
    x = expression()
    return ReturnStatement(x)

  elif accept("{")
    xs = []
    while True:
      xs.append(statement())
      if not accept(";"):
        break
    expect("}")
    return Block(xs)

  else:
    error("Invalid statement!")

This gives you what’s called an abstract syntax tree (AST) of your program, which you can then manipulate (optimisation and analysis), output (compilation), or run (interpretation).

181

answered Oct 06 '22 08:10

Jon Purdy

Most toolkits split the complete process into two separate parts

lexer (aka. tokenizer)
parser (aka. grammar)

The tokenizer will split the input data into tokens. The parser will only operate on the token "stream" and build the structure.

Your question seems to be focused on the tokenizer. But your second solution mixes the grammar parser and the tokenizer into one step. Theoretically this is also possible but for a beginner it is much easier to do it the same way as most other tools/framework: keep the steps separate.

To your first solution: I would tokenize your example like this:

T_KEYWORD_IF   "if"
T_LPAREN       "("
T_IDENTIFIER   "x"
T_GT           ">"
T_LITARAL      "5"
T_RPAREN       ")"
T_KEYWORD_RET  "return"
T_KEYWORD_TRUE "true"
T_TERMINATOR   ";"

In most languages keywords cannot be used as method names, variable names and so on. This is reflected already on the tokenizer level (T_KEYWORD_IF, T_KEYWORD_RET, T_KEYWORD_TRUE).

The next level would take this stream and - by applying a formal grammar - would build some datastructure (often called AST - Abstract Syntax Tree) which might look like this:

IfStatement:
    Expression:
        BinaryOperator:
            Operator:     T_GT
            LeftOperand: 
               IdentifierExpression:
                   "x"
            RightOperand:
                LiteralExpression
                    5
    IfBlock
        ReturnStatement
            ReturnExpression
                LiteralExpression
                    "true"
    ElseBlock (empty)

Implementing the parser by hand is usually done by some frameworks. Implementing something like that by hand and efficiently is usually done at a university in the better part of a semester. So you really should use some kind of framework.

The input for a grammar parser framework is usually a formal grammar in some kind of BNF. Your "if" part migh look like this:

IfStatement: T_KEYWORD_IF T_LPAREN Expression T_RPAREN Statement ;

Expression: LiteralExpression | BinaryExpression | IdentifierExpression | ... ;

BinaryExpression: LeftOperand BinaryOperator RightOperand;

....

That's only to get the idea. Parsing a realworld-language like Javascript correctly is not an easy task. But funny.

answered Oct 06 '22 07:10

A.H.

Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.

What's the original way ? There are many different ways to implement languages. I think yours is fine actually, I once tried to build a language myself that translated to C#, the hack programming language. Many language compilers translate to an intermediate language, it's quite common.

After I tokenizer, what I need do exactly? I'm really lost on this pass!

After tokenizing, you need to parse it. Use some good lexer / parser framework, such as the Boost.Spirit, or Coco, or whatever. There are hundreds of them. Or you can implement your own lexer, but that takes time and resources. There are many ways to parse code, I generally rely on recursive descent parsing.

Next you need to do Code Generation. That's the most difficult part in my opinion. There are tools for that too, but you can do it manually if you want to, I tried to do it in my project, but it was pretty basic and buggy, there's some helpful code here and here.

There are some good tutorial to learn how I can do it?

As I suggested earlier, use tools to do it. There are a lot of pretty good well-documented parser frameworks. For further information, you can try asking some people who know about this stuff. @DeadMG , over at the Lounge C++ is building a programming language called "Wide". You may try consulting him.

answered Oct 06 '22 09:10

ApprenticeHacker

Let's say I have this statement in a programming language:

if (0 < 1) then
   print("Hello")

The lexer will translate it into:

keyword: if
num: 0
op: <
num: 1
keyword: then
keyword: print
string: "Hello"

The parser will then take the information (aka "Token Stream") and make this:

if:
  expression:
    <:
      0, 1
then:
  print:
    "Hello"

I don't know if this will help or not, but I hope it does.

answered Oct 06 '22 07:10

InfiniteDonuts

Related questions
                            
                                Is there a lightweight multipart/form-data parser in C or C++? [closed]
                            
                                How to hide API keys in GitHub for iOS (SWIFT) projects?
                            
                                Golang parse JSON array into data structure
                            
                                Why some compilers prefer hand-crafted parser over parser generators?
                            
                                getting all the values of an array with jq
                            
                                Get the actual email message that the person just wrote, excluding any quoted text
                            
                                how do I parse an iso 8601 date (with optional milliseconds) to a struct tm in C++?
                            
                                Can I convert a string to enum without macros in Rust?
                            
                                DateTime parsing
                            
                                What is the most efficient way to parse a CSS color in JavaScript?
                            
                                Beautiful Soup to parse url to get another urls data
                            
                                Parsing variables from config file in Bash
                            
                                Should data be formatted in the backend or front-end?
                            
                                How can I see parse tree, intermediate code, optimization code and assembly code during COMPILATION?
                            
                                Equivalent to InnerHTML when using lxml.html to parse HTML
                            
                                Code to parse user agent string?
                            
                                Simplest way to correctly load html from web page into a string in Java
                            
                                HtmlAgilityPack set node InnerText
                            
                                Where is a good Address Parser [closed]
                            
                                Writing a parser from scratch in Haskell

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With