Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flex / Lex Encoding Strings with Escaped Characters

I'll refer to this question for some of the background:

Regular expression for a string literal in flex/lex

The problem I am having is handling the input with escaped characters in my lexer and I think it may be an issue to do with the encoding of the string, but I'm not sure.

Here's is how I am handling string literals in my lexer:

\"(\\.|[^\\"])*\"
{                   
    char* text1 = strndup(yytext + 1, strlen(yytext) - 2);
    char* text2 = "text\n";

    printf("value = <%s> <%x>\n", text1, text1);
    printf("value = <%s> <%x>\n", text2, text2);
}

This outputs the following:

value = <text\n"> <15a1bb0>
value = <text
> <7ac871>

It appears to be treating the newline character separately as a backslash followed by an n.

What's going on here, how do I process the text to be identical to the C input?

like image 928
Dan Avatar asked Mar 24 '11 11:03

Dan


People also ask

How do you write a character escaped string?

You can write: String newstr = "\\"; \ is a special character within a string used for escaping. "\" does now work because it is escaping the second " .

What is the '\ n escape character?

In particular, the \n escape sequence represents the newline character. A \n in a printf format string tells awk to start printing output at the beginning of a newline.

What is an escaped string?

Escaping a string means to reduce ambiguity in quotes (and other characters) used in that string. For instance, when you're defining a string, you typically surround it in either double quotes or single quotes: "Hello World."

What does escaped mean in regex?

Now, escaping a string (in regex terms) means finding all of the characters with special meaning and putting a backslash in front of them, including in front of other backslash characters. When you've done this one time on the string, you have officially "escaped the string".


1 Answers

Your regexp just matches string \ escapes -- it doesn't actually translate them into the characters that they represent. I prefer to handle this sort of thing with a flex start state and string building buffer that can accumulate characters. Something like:

%{
static StringBuffer strbuf;
%}
%x string
%%

\"                  { BEGIN string; ClearBuffer(strbuf); }
<string>[^\\"\n]*   { AppendBufferString(strbuf, yytext); }
<string>\\n         { AppendBufferChar(strbuf, '\n'); }
<string>\\t         { AppendBufferChar(strbuf, '\t'); }
<string>\\[0-7]*    { AppendBufferChar(strbuf, strtol(yytext+1, 0, 8)); }
<string>\\[\\"]     { AppendBufferChar(strbuf, yytext[1]); }
<string>\"          { yylval.str = strdup(BufferData(strbuf)); BEGIN 0; return STRING; }
<string>\\.         { error("bogus escape '%s' in string\n", yytext); }
<string>\n          { error("newline in string\n"); }

This makes what is going on much clearer, makes it easy to add new escape processing code for new escapes, and makes it easy to issue clear error messages when something goes wrong.

like image 155
Chris Dodd Avatar answered Oct 31 '22 10:10

Chris Dodd