Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non-Greedy Regular Expression Matching in Flex

I have just started with Flex and can't seem to figure out how to match the following Expression :

"Dog".*"Cat"
------------------
Input :
Dog Ca Cat Cc Cat
------------------
Output:
Dog Ca Cat Cc Cat

But I want a non-greedy matching, with the following output :

Output:
Dog Ca Cat

How can this be acheived on Flex ?

EDIT

Tried the following :

%%
Dog.*Cat/.*Cat  printf("Matched : ||%s||", yytext);
dog.*cat        printf("Matched : ||%s||", yytext);
dOg[^c]*cAt     printf("Matched : ||%s||", yytext);
DOG.*?CAT       printf("Matched : ||%s||", yytext);
%%

Input :

Dog Ca Cat Cc Cat
dog Ca cat Cc cat
dOg Ca cAt Cc cAt
DOG CA CAT CC CAT

Output :

Matched : ||Dog Ca Cat Cc Cat||
Matched : ||dog Ca cat Cc cat||
Matched : ||dOg Ca cAt|| Cc cAt
Matched : ||DOG CA CAT CC CAT||

Also receiving a warning :

lex4.l:2: warning, dangerous trailing context

Flex Version :

flex 2.5.35 Apple(flex-31)
like image 923
Kyuubi Avatar asked Oct 21 '13 07:10

Kyuubi


1 Answers

This is quite a common issue with using the lex/flex tools that stumps beginners (and sometime non-beginners). There are two solutions to the problem that require two different advanced features of the tools. A phrase like dog ... cat is much the same problem as matching comments in various programming languages, such as the C comment form /* ... */ or even 'comment' ... 'tnemmoc'. These have exactly the same characteristics as your example. Consider the following C code:

/* This is a comment */ "This is a String */"

A greedy lexical match of that would match the wrong comment terminator (and is a good test of a student lexer BTW!).

There are suggested solutions on several university compiler courses. The one that explains it well is here (at Manchester). Which cites a couple of good books which also cover the problems:

  • J.Levine, T.Mason & D.Brown: Lex and Yacc (2nd ed.)
  • M.E.Lesk & E.Schmidt: Lex - A Lexical Analyzer Generator

The two techniques described are to use Start Conditions to explicity specify the state machine, or manual input to read characters directly.

For your cat ... dog problem they can be programmed in the following ways:

Start Conditions

In this solution we need several states. The keyword dog causes causes it to enter the DOG state which continues until a letter c is encountered. This then enters the LETTERC state which must be followed by a letter a, if not the DOG state continues; a letter a causes the CAT state to be entered which must be followed by a letter t which causes the entire phrase to be matched and returns to the INITIAL state. The yymore causes the entire dog ... cat text to be retained for use.

%x DOG LETTERC CAT
d [dD]
o [oO]
g [gG]
c [cC]
a [aA]
t [tT]
ws [ \t\r\n]+
%%

<INITIAL>{d}{o}{g} {
        BEGIN(DOG);
        printf("DOG\n");
        yymore();
        }
<DOG>[^cC]*{c} {
        printf("C: %s\n",yytext);
        yymore();
        BEGIN(LETTERC);
        }
<LETTERC>{a} {
       printf("A: %s\n",yytext);
       yymore();
       BEGIN(CAT);
      }
<LETTERC>[^aA] {
        BEGIN(DOG);
        yymore();
        }
<CAT>{t} {
        printf("CAT: %s\n",yytext);
        BEGIN(INITIAL);
        }
<CAT>[^tT] {
        BEGIN(DOG);
        yymore();
        }
<INITIAL>{ws}  /* skip */ ;

Manual Input

The Manual input method just matches the start phrase dog and the enters C code which swallows up input characters until the desired cat sequence is encountered. (I did not bother with both upper and lower case letters). The problem with this solution is that it is hard to retain the input text value in yytext for later use in the parser. It discards it, which would be OK if the construct is a comment, but no so useful otherwise.

d [dD]
o [oO]
g [gG]
ws [ \t\r\n]+
%%
{d}{o}{g}   {
   register int c;

                     for ( ; ; )
                         {
                         /* Not dealt with upper case .. left as an exercise */
                         while ( (c = input()) != 'c' &&
                                 c != EOF )
                             ;    /* eat up text of dog */

                         if ( c == 'c' )
                             {
                              if ( ( c = input()) == 'a' )
                                     if ( (c = input()) == 't' )
                                 break;    /* found the end */
                             }
                        if ( c == EOF )
                             {
                             REJECT;
                             break;
                             }
                         }
            /* because we have used input() yytext always contains "dog" */
            printf("cat: %s\n", yytext);
       }
{ws}  /* skip */ ;

(Both these solutions have been tested)

like image 120
Brian Tompsett - 汤莱恩 Avatar answered Sep 21 '22 15:09

Brian Tompsett - 汤莱恩