Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using flex in c and regular expressions

I am trying to create a lexical analyzer for a compiler.But I have a problem using regular expressions to find things like keywords and real numbers.. for example some definitions :

id         [aA-zZ][aA-zZ-0-9_]* 

keyword    if|else|when|while

integer    [0-9]+

real       integer\.integer

..There are some problems though,the analyzer cant get a keyword for example instead if i give the word 'else' it sees it as a id(I get a warning like rule cannot be matched too.

Also if I try to give a real number for example 1.2 the linker sees it as integer delimeter integer instead as a real.I am not good at regural expressions language though,..I thought for the real/integer distinction to put a rule like ("Read a number that doesn't end with . and it's an integer , else it's a number") but how can I put that in reg. language.

like image 304
user2241915 Avatar asked Feb 06 '15 14:02

user2241915


People also ask

What is flex and regex?

RE/flex (regex-centric, fast lexical analyzer) is a free and open source computer program written in C++ that generates fast lexical analyzers (also known as "scanners" or "lexers") in C++.

Can you do regex in C?

A regular expression is a sequence of characters used to match a pattern to a string. The expression can be used for searching text and validating input. Remember, a regular expression is not the property of a particular language. POSIX is a well-known library used for regular expressions in C.

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .


2 Answers

The reason Flex isn't detecting the desired strings is there are ambiguities when reading input. The word 'else' corresponds both to regex "else" and [a-zA-Z][a-zA-Z0-9]*, but the last one was written first so Flex decides input must fit this regular expression.

Flex is a lexical analyzer that reads input file and check the current string with any of regular expressions you want it to check for. You must always remember two fundamental rules driving its string-matching system:

  • Longest Best Match;
  • Priority rules;

'Longest Best Match' tells Flex that read input must match the longest pattern possible; LBM rules work better with priority.

Priority rules tell Flex that if input match more that one regular expression, only the first one must be considered.

When writing code, keywords must be considered as keywords and not as IDs, then you should write keywords' regex before anything else: if something doesn't match 'keyword' regular expression, it means we are reading something else, like an id or the name of a function.

I would write something like this:

keyword     "if"|"else"|"when"|"while"
id          [a-zA-Z][a-zA-Z0-9_-]*
nat         [0-9]|([1-9][0-9]*)
integer     [+|-]*{nat}
real        {integer}[.]{integer}

Keyword first; if we aren't reading a keyword, it could be an id; if there are only numbers, let's see what kind of number it is. Priority rules will let Flex read 'else', 'if' and other keywords as 'keyword' tokens, and Longest best match will make sure a number like 1.2 will be considered a 'real' instead of two 'nat' separated by a dot.

like image 190
liggiorgio Avatar answered Oct 29 '22 15:10

liggiorgio


When the same lexeme can be matched by more than one rule, lex chooses the one that appears first in the file. So your rule for keywords must come before that for identifiers.

The problem with the rule integer.integer isn't that it can't be disambiguated from your integer rule - in fact it can be disambiguated trivially as there is no overlap at all. The problem is that it matches the string "integer", followed by any character, followed by the string "integer". You can't reference other rules within a rule. You could create a definition and reference that using curly braces or you could just write [0-9]+ '.' [0-9]+.

like image 37
sepp2k Avatar answered Oct 29 '22 13:10

sepp2k