Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lexical Analyser In Java

Tags:

java

lexer

I have been trying to write a simple lexical analyzer in java .

The File Token.java looks as follows :

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public enum Token {

    TK_MINUS ("-"), 
    TK_PLUS ("\\+"), 
    TK_MUL ("\\*"), 
    TK_DIV ("/"), 
    TK_NOT ("~"), 
    TK_AND ("&"),  
    TK_OR ("\\|"),  
    TK_LESS ("<"),
    TK_LEG ("<="),
    TK_GT (">"),
    TK_GEQ (">="), 
    TK_EQ ("=="),
    TK_ASSIGN ("="),
    TK_OPEN ("\\("),
    TK_CLOSE ("\\)"), 
    TK_SEMI (";"), 
    TK_COMMA (","), 
    TK_KEY_DEFINE ("define"), 
    TK_KEY_AS ("as"),
    TK_KEY_IS ("is"),
    TK_KEY_IF ("if"), 
    TK_KEY_THEN ("then"), 
    TK_KEY_ELSE ("else"), 
    TK_KEY_ENDIF ("endif"),
    OPEN_BRACKET ("\\{"),
    CLOSE_BRACKET ("\\}"),
    DIFFERENT ("<>"),

    STRING ("\"[^\"]+\""),
    INTEGER ("\\d"), 
    IDENTIFIER ("\\w+");

    private final Pattern pattern;

    Token(String regex) {
        pattern = Pattern.compile("^" + regex);
    }

    int endOfMatch(String s) {
        Matcher m = pattern.matcher(s);

        if (m.find()) {
            return m.end();
        }
        return -1;
    }
}

The Lexer is as follows : Lexer.java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;

public class Lexer {
    private StringBuilder input = new StringBuilder();
    private Token token;
    private String lexema;
    private boolean exausthed = false;
    private String errorMessage = "";
    private Set<Character> blankChars = new HashSet<Character>();

    public Lexer(String filePath) {
        try (Stream<String> st = Files.lines(Paths.get(filePath))) {
            st.forEach(input::append);
        } catch (IOException ex) {
            exausthed = true;
            errorMessage = "Could not read file: " + filePath;
            return;
        }

        blankChars.add('\r');
        blankChars.add('\n');
        blankChars.add((char) 8);
        blankChars.add((char) 9);
        blankChars.add((char) 11);
        blankChars.add((char) 12);
        blankChars.add((char) 32);

        moveAhead();
    }

    public void moveAhead() {
        if (exausthed) {
            return;
        }

        if (input.length() == 0) {
            exausthed = true;
            return;
        }

        ignoreWhiteSpaces();

        if (findNextToken()) {
            return;
        }

        exausthed = true;

        if (input.length() > 0) {
            errorMessage = "Unexpected symbol: '" + input.charAt(0) + "'";
        }
    }

    private void ignoreWhiteSpaces() {
        int charsToDelete = 0;

        while (blankChars.contains(input.charAt(charsToDelete))) {
            charsToDelete++;
        }

        if (charsToDelete > 0) {
            input.delete(0, charsToDelete);
        }
    }

    private boolean findNextToken() {
        for (Token t : Token.values()) {
            int end = t.endOfMatch(input.toString());

            if (end != -1) {
                token = t;
                lexema = input.substring(0, end);
                input.delete(0, end);
                return true;
            }
        }

        return false;
    }

    public Token currentToken() {
        return token;
    }

    public String currentLexema() {
        return lexema;
    }

    public boolean isSuccessful() {
        return errorMessage.isEmpty();
    }

    public String errorMessage() {
        return errorMessage;
    }

    public boolean isExausthed() {
        return exausthed;
    }
}

And can be tested with a Try.java as follows :

public class Try {

    public static void main(String[] args) {

        Lexer lexer = new Lexer("C:/Users/Input.txt");

        System.out.println("Lexical Analysis");
        System.out.println("-----------------");
        while (!lexer.isExausthed()) {
            System.out.printf("%-18s :  %s \n",lexer.currentLexema() , lexer.currentToken());
            lexer.moveAhead();
        }

        if (lexer.isSuccessful()) {
            System.out.println("Ok! :D");
        } else {
            System.out.println(lexer.errorMessage());
        }
    }
}

Say the Input.txt has

define mine 
a=1000;
b=23.5;

The output I expect is

define : TK_KEYWORD
mine : IDENTIFIER
a : IDENTIFIER
= : TK_ASSIGN
1000 : INTEGER
; : TK_SEMI
b : IDENTIFIER
= : TK_ASSIGN
23.5 : REAL

But The issue I am facing is : It treats each digit like

1 INTEGER
0 INTEGER
0 INTEGER
0 INTEGER

also it doesn't recognize Real numbers . I get:

Unexpected symbol: '.'

What are the changes needed to get the expected results?

like image 634
Vicky Avatar asked Mar 28 '17 11:03

Vicky


People also ask

What is a lexical analyzer used for?

The lexical analyzer is responsible for removing the white spaces and comments from the source program. It corresponds to the error messages with the source program. It helps to identify the tokens. The input characters are read by the lexical analyzer from the source code.

What is lexical analysis example?

A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages. Example of tokens: Type token (id, number, real, . . . ) Punctuation tokens (IF, void, return, . . . )

What is lexical analysis?

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of lexical tokens (strings with an assigned and thus identified meaning).

What is a lexer Java?

A Java lexer and parser generator is a tool that reads a grammar specification and converts it to a Java program that can recognize matches to the grammar. This tutorial focuses specifically on lexers. Lexical analysis, or "lexing", is the process of converting a sequence of characters into a sequence of tokens.


1 Answers

Your pattern to match integer is:

INTEGER ("\\d"), 

That matches exactly one digit.

If you want more than one, go for

INTEGER ("\\d+"), 

for example.

And, just for completion, the missing other pattern for floating point numbers could look like

REAL ("(\\d+)\\.\\d+")

as the comments pointed out. Or

REAL ("(\\d*)\\.\\d+")

to allow for

.23

too - if that is what you are looking for!

like image 192
GhostCat Avatar answered Oct 29 '22 09:10

GhostCat