Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are backslash escape sequences implemented in compilers?

I just wanted to know how backslash escape sequences are implemented in compilers? If we write "\n" in a string, how does a compiler come to replace it with a new line character? How does a compiler come to replace "\b" with a backspace character?

I ask because I wrote the code:

#include<stdio.h>
main()
{
    printf("Hello \c");
}

The output was:

Hello 
Exited: ExitFailure 7 

I ran it in codepad, I was going through KnR book question number 1.2.

Thanks in Advance

like image 551
Rasmi Ranjan Nayak Avatar asked Dec 26 '11 07:12

Rasmi Ranjan Nayak


3 Answers

To understand this, you have to understand a little bit about how compilers work in general. The first step which compilers generally undertake is called lexical analysis (or lexing for short). Lexical analysis is when the compiler takes the input code and breaks it into pieces which it can recognize. To do this, it usually uses regular expressions to recognize the different pieces. One of the pieces it recognizes is a string literal, which is a quoted string like "Hello". The regular expression for a string literal usually looks like "([^\"]|\"|\\|\n|\b)*". Which, in English, means a list of characters which starts with a double quote and ends with a double quote, and in between has either 1) any character which isn't a double quote or a backslash 2) a backslash and then a double quote 3) a backslash and then another backslash 4) a backslash and then an n 5) a backslash and then a b. This middle pattern is repeated zero or more times. (Note: in real compilers, the list of characters which can occur after the back-slash is generally longer). Looking for this pattern allows it to find string literals.

Then, once the string literal has been identified, to find out what string to actually put in memory, it has to do a second layer of processing which is to go through the string literal and handle the backslashes. It just reads from the start to the end, looking for backslash sequences. Each of the backslash sequences is replaced with a different character. \" becomes ". \\ becomes \. \n becomes a newline. \b becomes a backspace character, and so forth. To figure out which to put where, it just uses a table which shows what to put in place for that sequence.

like image 66
Keith Irwin Avatar answered Oct 12 '22 08:10

Keith Irwin


The classic explanation is given in the famous article by Ken Thompson called 'Reflections on Trusting Trust' (also available from many other sources, including the book ACM Turing Award Lectures: The First Twenty Years 1966-1985) which was his acceptance speech when he received the ACM Turing Award along with Dennis Ritchie.

Amongst other things, it describes how to add \v to a compiler that does not recognize it:

C allows a string construct to specify an initialized character array. The individual characters in the string can be escaped to represent unprintable characters. For example,

"Hello world\n"

represents a string with the character "\n", representing the new line character.

Figure 2.1 is an idealization of the code in the C compiler that interprets the character escape sequence. This is an amazing piece of code. It "knows" in a completely portable way what character code is compiled for a new line in any character set. The act of knowing then allows it to recompile itself, thus perpetuating the knowledge.

Suppose we wish to alter the C compiler to include the sequence "\v" to represent the vertical tab character. The extension to Figure 2.1 is obvious and is presented in Figure 2.2. We then recompile the C compiler, but we get a diagnostic. Obviously, since the binary version of the compiler does not know about "\v", the source is not legal C. We must "train" the compiler. After it "knows" what "\v" means, then our new change will become legal C. We look up on an ASCII chart that a vertical tab is decimal 11. We alter our source to look like Figure 2.3. Now the old compiler accepts the new source. We install the resulting binary as the new official C compiler and now we can write the portable version the way we had it in Figure 2.2.

This is a deep concept. It is as close to a "learning" program as I have seen. You simply tell it once, then you can use this self-referencing definition.

Figure 2.1

c = next();
if (c != '\\')
    return(c);
c = next();
if (c == '\\')
    return('\\');
if (c == 'n')
    return('\n');

Figure 2.2

c = next();
if (c != '\\')
    return(c);
c = next();
if (c == '\\')
    return('\\');
if (c == 'n')
    return('\n');
if (c == 'v')
    return('\v');

Figure 2.3

c = next();
if (c != '\\')
    return(c);
c = next();
if (c == '\\')
    return('\\');
if (c == 'n')
    return('\n');
if (c == 'v')
    return(11);
like image 27
Jonathan Leffler Avatar answered Oct 12 '22 09:10

Jonathan Leffler


Here is an excellent overview of what a compiler is. It lists the components: Difference between compilers and parsers?

The short answer is that the compiler is a string recognizer. It sees something that matches a rule (based on context), and then make decision what the outcome should be.

Here is a related post, and one of the post also recommends what Jonathan Leffler recommended. What's the Magic Behind Escape(\) Character

Another short answer to the whole compiler thing is grammar.

like image 42
CppLearner Avatar answered Oct 12 '22 08:10

CppLearner