Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Confused about C string constants

Tags:

c

string

When I came across this C language implementation of Porters Stemming algorithm I found a C-ism I was confused about.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void test( char *s )
{
    int len = s[0];

    printf("len= %i\n", len );

    printf("s[len] = %c\n", s[len] );
}

int main()
{
    test("\07" "abcdefg");

    return 0;
}

and output:

len = 7
s[len] = g

However, when I input

test("\08" "abcdefgh");

or any string constant that is longer than 7 with the corresponding length in the first pair of parenthesis ( i.e. test("\09" "abcdefghi"); the output is

len = 0
s[len] = 

But any input like test("\01" "abcdefgh"); prints out the character in that position ( if we call the first character position 1 and not 0 for the moment )

It appears if test( char *s ) reads the number in the first pair of parenthesis ( how it does this I am not sure since I thought s[0] would be able to only read a single char, i.e. the '\' ) and prints the last character at that index + 1 of the string constant in the second pair of parenthesis.

My question is this: It seems as if we are passing two string constants into test( char *s ). What exactly is happening here, meaning, how does the compiler seem to "split" up the string over two pairs of parenthesis? Another question one might have is, is a string of the form "blah" "abcdefg" one consecutive block of memory? It may be the case that I have overlooked something elementary, but even so I would like to know what I overlooked. I know this is a basic concept but I could not find a clear example or situation on the web that explains this and in all honesty I don't follow the output. Any helpful comments are welcomed.

like image 632
Rob L Avatar asked Jul 02 '14 20:07

Rob L


4 Answers

There are at least three things going on here:

  • Literal strings juxtaposed against one another are concatenated by the compiler. "a" "b" is exactly the same as "ab".

  • The backslash is an escape character, which means it is not copied literally into the resulting string. The notation \01 means "the character with ASCII value 1".

  • The notation \0... means an octal character constant. Octal numbers are base 8, made up from digits that range from 0 through 7 inclusive. 8 is not a valid octal constant, so "\08" does not follow "\07".

like image 145
Greg Hewgill Avatar answered Sep 25 '22 02:09

Greg Hewgill


The problem is not in the length of the string, but in the \o syntax for specifying non-printable values in string literals. \o, \oo, and \ooo denote octal constants, i.e. a single character whose value is written in base 8. Since 08 in \08 doesn't represent a valid base 8 number, it is interpreted as \0 followed by the ASCII character 8.

To fix the problem, represent 8 as \10 or \010:

test("\007" "abcdefg");
test("\010" "abcdefgh");

...or switch to hexadecimal, where the \x prefix makes the base more explicit to the casual reader:

test("\x07" "abcdefg");
test("\x08" "abcdefgh");
test("\x09" "abcdefghi");
test("\x0a" "abcdefghij");
...
like image 28
user4815162342 Avatar answered Sep 24 '22 02:09

user4815162342


\number in a character or string literal is means the character whose code is the value number. number is interpreted in octal, so the first non-octal digit terminates the number. So "\07" is a one-character string containing the character with code 7, but \08 is a two-character string containing the character with code 0 followed by the digit 8.

Additionally, code 0 the null terminator that's used in C to indicate the end of the string. So that second string ends at the beginning, because its first byte is the terminator. This why the length of the string in your second example is 0.

like image 31
Barmar Avatar answered Sep 24 '22 02:09

Barmar


When two or more string literals are adjacent (separated only by white-space), the compiler will join them into a single string. Therefore "\07" "abcdefg" is equivalent to "\07abcdefg". "\07" is an octal escape. An octal escape ends after three digits or with first non-octal character. So, when you enter "\08", 8 is a non octal character therefore escape ends and 0 is stored at s[0].
Now, len is 0 and printing s[len] will try to print the character at s[0] which has a non printable ASCII code (Only character above ASCII value above 32 are printable).

like image 21
haccks Avatar answered Sep 24 '22 02:09

haccks