I'm working on a shell, a small bash-like shell, without scripting (if while ...) I have to make the lexer/parser (LL) by hand.
So the lexer will transform the command (char *cmd) to a linked list (t_list *list). And the LL parser will transform the linked list (t_list *list) to an AST (binary tree t_btree *root) with a grammar
So, I know how to make the LL parser but I don't know how to tokenize my command.
For example: ps | grep ls >> file ; make && ./a.out
=> 'ps' '|' 'grep' 'ls' '>>' 'file' ';' ''make '&&' './a.out'
Thanks.
(I don't wanna use any generator)
The program within the compiler which is responsible for doing the work of scanning and evaluating is often referred to as the lexer or the tokenizer, and the entire lexical analysis phase is sometimes called the process of lexing or tokenizing.
Overview. The lexer is contained in the file lex.cc . It is a hand-coded lexer, and not implemented as a state machine. It can understand C, C++ and Objective-C source code, and has been extended to allow reasonably successful preprocessing of assembly language.
The lexer just turns the meaningless string into a flat list of things like "number literal", "string literal", "identifier", or "operator", and can do things like recognizing reserved identifiers ("keywords") and discarding whitespace. Formally, a lexer recognizes some set of Regular languages.
(This explains the idea hinted by Spudd86).
You need to implement a finite state machine. There are the following states:
&&
token||
tokenFor each state and next input character, you have to decide what is the next state, and whether to output a token. For example:
It's much boring work to work out all the rules (the fun starts when you must debug the resulting code), so most people use code generators to do that.
Edit: some code (sorry if the syntax is messed-up; i usually program in C++)
enum state {
STATE_GENERAL,
STATE_IN_FILENAME,
...
};
// Many characters are treated the same (e.g. 'x' and 'y') - so use categories
enum character_category
{
CHAR_GENERAL, // can appear in filenames
CHAR_WHITESPACE = ' ',
CHAR_AMPERSAND = '&',
CHAR_PIPE = '|',
CHAR_EOF = EOF,
...
};
character_category translate(int c)
{
switch (c) {
case '&': return CHAR_AMPERSAND;
case ' ': case '\t': case '\n': return CHAR_WHITESPACE;
...
default: return CHAR_GENERAL;
}
}
void do_stuff()
{
character_category cat;
state current_state = STATE_GENERAL;
state next_state;
char token[100];
char token_length = 0;
do {
int c = getchar();
cat = translate(c);
// The following implements a switch on 2 variables
int selector = 1000 * current_state + cat;
switch (selector)
{
case 1000 * STATE_GENERAL + CHAR_GENERAL:
next_state = STATE_IN_FILENAME;
token[token_length++] = c; // append a character to a filename token
break;
case 1000 * STATE_GENERAL + CHAR_WHITESPACE:
next_state = STATE_GENERAL; // do nothing
break;
case 1000 * STATE_GENERAL + CHAR_PIPE:
next_state = STATE_IN_OR_TOKEN; // the first char in '||' or just '|'
break;
// Much repetitive code already; define a macro for the case constants?
// Have to cover all states and all character categories; good luck...
case 1000 * STATE_IN_FILENAME + EOF:
case 1000 * STATE_IN_FILENAME + CHAR_WHITESPACE:
next_state = STATE_GENERAL;
printf("Filename token: %s\n", token);
break;
default:
printf("Bug\n"); // forgot one of the cases?
}
current_state = next_state;
} while (cat != CHAR_EOF);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With