I have been trying to write a simple lexical analyzer in java .
The File Token.java looks as follows :
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public enum Token {
TK_MINUS ("-"),
TK_PLUS ("\\+"),
TK_MUL ("\\*"),
TK_DIV ("/"),
TK_NOT ("~"),
TK_AND ("&"),
TK_OR ("\\|"),
TK_LESS ("<"),
TK_LEG ("<="),
TK_GT (">"),
TK_GEQ (">="),
TK_EQ ("=="),
TK_ASSIGN ("="),
TK_OPEN ("\\("),
TK_CLOSE ("\\)"),
TK_SEMI (";"),
TK_COMMA (","),
TK_KEY_DEFINE ("define"),
TK_KEY_AS ("as"),
TK_KEY_IS ("is"),
TK_KEY_IF ("if"),
TK_KEY_THEN ("then"),
TK_KEY_ELSE ("else"),
TK_KEY_ENDIF ("endif"),
OPEN_BRACKET ("\\{"),
CLOSE_BRACKET ("\\}"),
DIFFERENT ("<>"),
STRING ("\"[^\"]+\""),
INTEGER ("\\d"),
IDENTIFIER ("\\w+");
private final Pattern pattern;
Token(String regex) {
pattern = Pattern.compile("^" + regex);
}
int endOfMatch(String s) {
Matcher m = pattern.matcher(s);
if (m.find()) {
return m.end();
}
return -1;
}
}
The Lexer is as follows : Lexer.java
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;
public class Lexer {
private StringBuilder input = new StringBuilder();
private Token token;
private String lexema;
private boolean exausthed = false;
private String errorMessage = "";
private Set<Character> blankChars = new HashSet<Character>();
public Lexer(String filePath) {
try (Stream<String> st = Files.lines(Paths.get(filePath))) {
st.forEach(input::append);
} catch (IOException ex) {
exausthed = true;
errorMessage = "Could not read file: " + filePath;
return;
}
blankChars.add('\r');
blankChars.add('\n');
blankChars.add((char) 8);
blankChars.add((char) 9);
blankChars.add((char) 11);
blankChars.add((char) 12);
blankChars.add((char) 32);
moveAhead();
}
public void moveAhead() {
if (exausthed) {
return;
}
if (input.length() == 0) {
exausthed = true;
return;
}
ignoreWhiteSpaces();
if (findNextToken()) {
return;
}
exausthed = true;
if (input.length() > 0) {
errorMessage = "Unexpected symbol: '" + input.charAt(0) + "'";
}
}
private void ignoreWhiteSpaces() {
int charsToDelete = 0;
while (blankChars.contains(input.charAt(charsToDelete))) {
charsToDelete++;
}
if (charsToDelete > 0) {
input.delete(0, charsToDelete);
}
}
private boolean findNextToken() {
for (Token t : Token.values()) {
int end = t.endOfMatch(input.toString());
if (end != -1) {
token = t;
lexema = input.substring(0, end);
input.delete(0, end);
return true;
}
}
return false;
}
public Token currentToken() {
return token;
}
public String currentLexema() {
return lexema;
}
public boolean isSuccessful() {
return errorMessage.isEmpty();
}
public String errorMessage() {
return errorMessage;
}
public boolean isExausthed() {
return exausthed;
}
}
And can be tested with a Try.java as follows :
public class Try {
public static void main(String[] args) {
Lexer lexer = new Lexer("C:/Users/Input.txt");
System.out.println("Lexical Analysis");
System.out.println("-----------------");
while (!lexer.isExausthed()) {
System.out.printf("%-18s : %s \n",lexer.currentLexema() , lexer.currentToken());
lexer.moveAhead();
}
if (lexer.isSuccessful()) {
System.out.println("Ok! :D");
} else {
System.out.println(lexer.errorMessage());
}
}
}
Say the Input.txt has
define mine
a=1000;
b=23.5;
The output I expect is
define : TK_KEYWORD
mine : IDENTIFIER
a : IDENTIFIER
= : TK_ASSIGN
1000 : INTEGER
; : TK_SEMI
b : IDENTIFIER
= : TK_ASSIGN
23.5 : REAL
But The issue I am facing is : It treats each digit like
1 INTEGER
0 INTEGER
0 INTEGER
0 INTEGER
also it doesn't recognize Real numbers . I get:
Unexpected symbol: '.'
What are the changes needed to get the expected results?
The lexical analyzer is responsible for removing the white spaces and comments from the source program. It corresponds to the error messages with the source program. It helps to identify the tokens. The input characters are read by the lexical analyzer from the source code.
A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages. Example of tokens: Type token (id, number, real, . . . ) Punctuation tokens (IF, void, return, . . . )
In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of lexical tokens (strings with an assigned and thus identified meaning).
A Java lexer and parser generator is a tool that reads a grammar specification and converts it to a Java program that can recognize matches to the grammar. This tutorial focuses specifically on lexers. Lexical analysis, or "lexing", is the process of converting a sequence of characters into a sequence of tokens.
Your pattern to match integer is:
INTEGER ("\\d"),
That matches exactly one digit.
If you want more than one, go for
INTEGER ("\\d+"),
for example.
And, just for completion, the missing other pattern for floating point numbers could look like
REAL ("(\\d+)\\.\\d+")
as the comments pointed out. Or
REAL ("(\\d*)\\.\\d+")
to allow for
.23
too - if that is what you are looking for!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With