Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does ISO C allow aliasing of the argv[] pointers supplied to main()?

ISO C requires that hosted implementations call a function named main. If the program receives arguments, they are received as an array of char* pointers, the second argument in main's definition int main(int argc, char* argv[]).

ISO C also requires that the strings pointed to by the argv array be modifiable.

But can the elements of argv alias one another? In other words, can there exist i, j such that

  • 0 >= i && i < argc
  • 0 >= j && j < argc
  • i != j
  • 0 < strlen(argv[i])
  • strlen(argv[i]) <= strlen(argv[j])
  • argv[i] aliases argv[j]

at program start-up? If so, a write through argv[i][0] would also be seen through the aliasing string argv[j].

The relevant clauses of the ISO C Standard are below, but do not allow me to conclusively answer the titular question.

§ 5.1.2.2.1 Program startup

The function called at program startup is named main. The implementation declares no prototype for this function. It shall be defined with a return type of int and with no parameters:

int main(void) { /* ... */ }

or with two parameters (referred to here as argc and argv, though any names may be used, as they are local to the function in which they are declared):

int main(int argc, char *argv[]) { /* ... */ }

or equivalent; 10) or in some other implementation-defined manner.

If they are declared, the parameters to the main function shall obey the following constraints:

  • The value of argc shall be nonnegative.
  • argv[argc] shall be a null pointer.
  • If the value of argc is greater than zero, the array members argv[0] through argv[argc-1] inclusive shall contain pointers to strings, which are given implementation-defined values by the host environment prior to program startup. The intent is to supply to the program information determined prior to program startup from elsewhere in the hosted environment. If the host environment is not capable of supplying strings with letters in both uppercase and lowercase, the implementation shall ensure that the strings are received in lowercase.
  • If the value of argc is greater than zero, the string pointed to by argv[0] represents the program name; argv[0][0] shall be the null character if the program name is not available from the host environment. If the value of argc is greater than one, the strings pointed to by argv[1] through argv[argc-1] represent the program parameters.
  • The parameters argc and argv and the strings pointed to by the argv array shall be modifiable by the program, and retain their last-stored values between program startup and program termination.

By my reading, the answer to the titular question is "yes", since nowhere is it explicitly forbidden and nowhere does the standard urge or require the use of char* restrict*-qualified argv, but the answer might turn on the interpretation of "and retain their last-stored values between program startup and program termination.".

The practical import of this question is that if the answer to it is indeed "yes", a portable program that wishes to modify the strings in argv must first perform (the equivalent of) POSIX strdup() on them for safety.

like image 595
Iwillnotexist Idonotexist Avatar asked Jun 10 '18 00:06

Iwillnotexist Idonotexist


3 Answers

By my reading, the answer to the titular is "yes", since nowhere is it explicitly forbidden and nowhere does the standard urge or require the use of restrict-qualified argv, but the answer might turn on the interpretation of "and retain their last-stored values between program startup and program termination.".

I concur that the standard does not explicitly forbid elements of the argument vector from being aliases of each other. I don't think the modifiability and value-retention provisions contradict that position, but they do suggest to me that the committee did not consider the possibility of aliasing.

The practical import of this question is that if the answer to it is indeed "yes", a portable program that wishes to modify the strings in argv must first perform (the equivalent of) POSIX strdup() on them for safety.

Indeed, that's exactly why I think the committee didn't even consider the possibility. If they had done then surely they would have at least included a footnote to that same effect, or else explicitly specified that the argument strings are all distinct.

I'm inclined to think that this detail escaped the committee's attention because in practice, implementations indeed do provide distinct strings, and because it is rare, moreover, for programs to modify their argument strings (though modifying argv itself is somewhat more common). If the committee agreed to issue an official interpretation in this area, then I would not be surprised for them to come down against the possibility of aliasing.

Until and unless such an interpretation is issued, however, you are right that strict conformance does not permit you to rely a priori on argv elements not being aliased.

like image 163
John Bollinger Avatar answered Oct 01 '22 23:10

John Bollinger


The way it works on common *nix platforms (including Linux and Mac OS, presumably FreeBSD too) is that argv is an array of pointers into a single memory area containing the argument strings one after another (separated only by the null terminator). Using execl() does not change this--even if the caller passes the same pointer multiple times, the source string is copied multiple times, with no special behavior for identical (i.e. aliased) pointers (an uncommon case with no great benefit to optimize).

However, C does not require this implementation. The truly paranoid may want to copy every string before modifying it, perhaps skipping the copies if memory is limited and a loop over argv shows that none of the pointers actually alias (at least among those the program intends to modify). This seems overly paranoid unless you are developing flight software or the like.

like image 41
John Zwinck Avatar answered Oct 01 '22 23:10

John Zwinck


As a data point, I have compiled and run the following programs on several systems. (Disclaimer: these programs are intended to provide a data point, but as we'll see, they do not end up answering the question as stated.)

p1.c:

#include <stdio.h>
#include <unistd.h>

int main()
{
    char test[] = "test";
    execl("./p2", "p2", test, test, NULL);
}

p2.c:

#include <stdio.h>

int main(int argc, char **argv)
{
    int i;
    for(i = 1; i < argc; i++) printf("%s ", argv[i]); printf("\n");
    argv[1][0] = 'b';
    for(i = 1; i < argc; i++) printf("%s ", argv[i]); printf("\n");
}

Every place I've tried it (under MacOS and several flavors of Unix and Linux) it has printed

test test 
best test 

Since the second line was never "best best", this proves that, on the tested systems, by the time the second program is run, the strings are no longer aliased.

Of course, this test does not prove that strings in argv can never be aliased, under any circumstances, under any system out there. I think all it proves is that, unsurprisingly, each of the tested operating systems recopies the argument list at least once between the time p1 calls execl and the time that p2 is actually invoked. In other words, the argument vector constructed by the invoking program is not used directly in the called program, and in the process of copying it, it is (again not surprisingly) "normalized", meaning that the effects of any aliasing are lost.

(I say this is not surprising because if you think about the way the exec family of system calls actually work, and the way process memory is laid out under Unix-like systems, there's no way that the invoking program's argument list could be used directly; it has to be copied, at least once, into the address space of the new, exec'ed process. Furthermore, any obvious and straightforward method of copying the argument list is always and automatically going to "normalize" it in this way; the kernel would have to do significant, extra, totally unnecessary work in order to detect and preserve any aliasing.)

Just in case it matters, I modified the first program in this way:

#include <stdio.h>
#include <unistd.h>

int main()
{
    char test[] = "test";
    char *argv[] = {"p2", test, test, NULL};
    execv("./p2", argv);
}

The results were unchanged.


With all of this said, I agree that this issue does seem like an oversight or buglet in the standards. I'm not aware of any clause guaranteeing that the strings pointed to by argv are distinct, meaning that a paranoidly-written program probably can't depend on such a guarantee, no matter how likely it is that (as this answer demonstrates) any reasonable implementation is likely to do it that way.

like image 42
Steve Summit Avatar answered Oct 01 '22 22:10

Steve Summit