I have just started with Flex and can't seem to figure out how to match the following Expression :
"Dog".*"Cat"
------------------
Input :
Dog Ca Cat Cc Cat
------------------
Output:
Dog Ca Cat Cc Cat
But I want a non-greedy matching, with the following output :
Output:
Dog Ca Cat
How can this be acheived on Flex ?
EDIT
Tried the following :
%%
Dog.*Cat/.*Cat printf("Matched : ||%s||", yytext);
dog.*cat printf("Matched : ||%s||", yytext);
dOg[^c]*cAt printf("Matched : ||%s||", yytext);
DOG.*?CAT printf("Matched : ||%s||", yytext);
%%
Input :
Dog Ca Cat Cc Cat
dog Ca cat Cc cat
dOg Ca cAt Cc cAt
DOG CA CAT CC CAT
Output :
Matched : ||Dog Ca Cat Cc Cat||
Matched : ||dog Ca cat Cc cat||
Matched : ||dOg Ca cAt|| Cc cAt
Matched : ||DOG CA CAT CC CAT||
Also receiving a warning :
lex4.l:2: warning, dangerous trailing context
Flex Version :
flex 2.5.35 Apple(flex-31)
This is quite a common issue with using the lex/flex tools that stumps beginners (and sometime non-beginners). There are two solutions to the problem that require two different advanced features of the tools. A phrase like dog ... cat
is much the same problem as matching comments in various programming languages, such as the C comment form /* ... */
or even 'comment' ... 'tnemmoc'
. These have exactly the same characteristics as your example. Consider the following C code:
/* This is a comment */ "This is a String */"
A greedy lexical match of that would match the wrong comment terminator (and is a good test of a student lexer BTW!).
There are suggested solutions on several university compiler courses. The one that explains it well is here (at Manchester). Which cites a couple of good books which also cover the problems:
The two techniques described are to use Start Conditions to explicity specify the state machine, or manual input to read characters directly.
For your cat ... dog
problem they can be programmed in the following ways:
In this solution we need several states. The keyword dog
causes causes it to enter the DOG
state which continues until a letter c
is encountered. This then enters the LETTERC
state which must be followed by a letter a
, if not the DOG
state continues; a letter a
causes the CAT
state to be entered which must be followed by a letter t
which causes the entire phrase to be matched and returns to the INITIAL
state. The yymore
causes the entire dog ... cat
text to be retained for use.
%x DOG LETTERC CAT
d [dD]
o [oO]
g [gG]
c [cC]
a [aA]
t [tT]
ws [ \t\r\n]+
%%
<INITIAL>{d}{o}{g} {
BEGIN(DOG);
printf("DOG\n");
yymore();
}
<DOG>[^cC]*{c} {
printf("C: %s\n",yytext);
yymore();
BEGIN(LETTERC);
}
<LETTERC>{a} {
printf("A: %s\n",yytext);
yymore();
BEGIN(CAT);
}
<LETTERC>[^aA] {
BEGIN(DOG);
yymore();
}
<CAT>{t} {
printf("CAT: %s\n",yytext);
BEGIN(INITIAL);
}
<CAT>[^tT] {
BEGIN(DOG);
yymore();
}
<INITIAL>{ws} /* skip */ ;
The Manual input method just matches the start phrase dog
and the enters C code which swallows up input characters until the desired cat
sequence is encountered. (I did not bother with both upper and lower case letters). The problem with this solution is that it is hard to retain the input text value in yytext for later use in the parser. It discards it, which would be OK if the construct is a comment, but no so useful otherwise.
d [dD]
o [oO]
g [gG]
ws [ \t\r\n]+
%%
{d}{o}{g} {
register int c;
for ( ; ; )
{
/* Not dealt with upper case .. left as an exercise */
while ( (c = input()) != 'c' &&
c != EOF )
; /* eat up text of dog */
if ( c == 'c' )
{
if ( ( c = input()) == 'a' )
if ( (c = input()) == 't' )
break; /* found the end */
}
if ( c == EOF )
{
REJECT;
break;
}
}
/* because we have used input() yytext always contains "dog" */
printf("cat: %s\n", yytext);
}
{ws} /* skip */ ;
(Both these solutions have been tested)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With