Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a (shell) lexer by hand

I'm working on a shell, a small bash-like shell, without scripting (if while ...) I have to make the lexer/parser (LL) by hand.

So the lexer will transform the command (char *cmd) to a linked list (t_list *list). And the LL parser will transform the linked list (t_list *list) to an AST (binary tree t_btree *root) with a grammar

So, I know how to make the LL parser but I don't know how to tokenize my command.

For example: ps | grep ls >> file ; make && ./a.out

=> 'ps' '|' 'grep' 'ls' '>>' 'file' ';' ''make '&&' './a.out'

Thanks.

(I don't wanna use any generator)

like image 732
mathieug Avatar asked Mar 30 '11 20:03

mathieug


People also ask

What is a lexer in code?

The program within the compiler which is responsible for doing the work of scanning and evaluating is often referred to as the lexer or the tokenizer, and the entire lexical analysis phase is sometimes called the process of lexing or tokenizing.

What is lexer file?

Overview. The lexer is contained in the file lex.cc . It is a hand-coded lexer, and not implemented as a state machine. It can understand C, C++ and Objective-C source code, and has been extended to allow reasonably successful preprocessing of assembly language.

How does a lexer work?

The lexer just turns the meaningless string into a flat list of things like "number literal", "string literal", "identifier", or "operator", and can do things like recognizing reserved identifiers ("keywords") and discarding whitespace. Formally, a lexer recognizes some set of Regular languages.


1 Answers

(This explains the idea hinted by Spudd86).

You need to implement a finite state machine. There are the following states:

  • General state
  • Inside a file name
  • Inside the && token
  • Inside the || token

For each state and next input character, you have to decide what is the next state, and whether to output a token. For example:

  • Current state: General; character: x => next state: inside-file-name
  • Current state: inside-file-name; character: space => next state: General; output the token
  • Current state: inside-file-name; character: & => next state: inside-&&; output the token
  • Current state: inside-&&; character: & => next state: General; output the token
  • Current state: inside-&&; character: x => next state: General; syntax error
  • ... (ad nauseum)

It's much boring work to work out all the rules (the fun starts when you must debug the resulting code), so most people use code generators to do that.


Edit: some code (sorry if the syntax is messed-up; i usually program in C++)

enum state {
    STATE_GENERAL,
    STATE_IN_FILENAME,
    ...
};

// Many characters are treated the same (e.g. 'x' and 'y') - so use categories
enum character_category
{
    CHAR_GENERAL, // can appear in filenames
    CHAR_WHITESPACE = ' ',
    CHAR_AMPERSAND = '&',
    CHAR_PIPE = '|',
    CHAR_EOF = EOF,
    ...
};

character_category translate(int c)
{
    switch (c) {
    case '&': return CHAR_AMPERSAND;
    case ' ': case '\t': case '\n': return CHAR_WHITESPACE;
    ...
    default: return CHAR_GENERAL;
    }
}

void do_stuff()
{
    character_category cat;
    state current_state = STATE_GENERAL;
    state next_state;
    char token[100];
    char token_length = 0;
    do {
        int c = getchar();
        cat = translate(c);
        // The following implements a switch on 2 variables
        int selector = 1000 * current_state + cat;

        switch (selector)
        {
        case 1000 * STATE_GENERAL + CHAR_GENERAL:
            next_state = STATE_IN_FILENAME;
            token[token_length++] = c; // append a character to a filename token
            break;

        case 1000 * STATE_GENERAL + CHAR_WHITESPACE:
            next_state = STATE_GENERAL; // do nothing
            break;

        case 1000 * STATE_GENERAL + CHAR_PIPE:
            next_state = STATE_IN_OR_TOKEN; // the first char in '||' or just '|'
            break;

        // Much repetitive code already; define a macro for the case constants?
        // Have to cover all states and all character categories; good luck...

        case 1000 * STATE_IN_FILENAME + EOF:
        case 1000 * STATE_IN_FILENAME + CHAR_WHITESPACE:
            next_state = STATE_GENERAL;
            printf("Filename token: %s\n", token);
            break;

        default:
            printf("Bug\n"); // forgot one of the cases?
        }

        current_state = next_state;

    } while (cat != CHAR_EOF);
}
like image 80
anatolyg Avatar answered Oct 18 '22 03:10

anatolyg