Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

strtok() issue: If tokens are delimited by delimiters,why is last token between a delimiter and the null '\0'?

In the following program, strtok() works as expected in the major part but I just can't comprehend the reason behind one finding. I have read about strtok() that:

To determine the beginning and the end of a token, the function first scans from the starting location for the first character not contained in delimiters (which becomes the beginning of the token). And then scans starting from this beginning of the token for the first character contained in delimiters, which becomes the end of the token.

Source: http://www.cplusplus.com/reference/cstring/strtok/

And as we know, strtok() places a \0 at the end of each token. But in the following program, the last delimiter is a dot(.), after which there is Toad between that dot and the quotation mark ("). Now the dot is a delimiter in my program, but there is no delimiter after Toad, not even a white space (which is a delimiter in my program). Please clear the following confusion arising from this premise:

Why is strtok() considering Toad as a token even though it is not between 2 delimiters? This is what I read about strtok() when it encounters a NULL character (\0):

Once the terminating null character of str has been found in a call to strtok, all subsequent calls to this function with a null pointer as the first argument return a null pointer.

Source: http://www.cplusplus.com/reference/cstring/strtok/

Nowhere does it say that once a null character is encountered,a pointer to the beginning of the token is returned (we don't even have a token here as we didn't get an end of the token as there was no delimiter character found after the scan begun from the beginning of the token (i.e. from 'T' of Toad), we only found a null character, not a delimiter). So why is the part between last delimiter and quotation mark of argument string considered a token by strtok()? Please explain this.

Code:

#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] =" Falcon,eagle-hawk..;buzzard,gull..pigeon sparrow,hen;owl.Toad";
  char * pch=strtok(str," ;,.-");

    while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ;,.-");
  }

  return 0;
}

Output:

Falcon
eagle
hawk
buzzard
gull
pigeon
sparrow
hen
owl
Toad

like image 783
Rüppell's Vulture Avatar asked May 15 '13 17:05

Rüppell's Vulture


People also ask

Why does strtok take NULL?

When there are no tokens left to retrieve, strtok returns NULL, meaning that the string has been fully tokenized.

What does strtok return if no delimiter?

strtok() returns a NULL pointer. The token ends with the first character contained in the string pointed to by string2. If such a character is not found, the token ends at the terminating NULL character. Subsequent calls to strtok() will return the NULL pointer.

Does strtok include the delimiter?

Each call to strtok() returns a pointer to a null-terminated string containing the next token. This string does not include the delimiting byte.

What does strtok () do in C?

The C function strtok() is a string tokenization function that takes two arguments: an initial string to be parsed and a const -qualified character delimiter. It returns a pointer to the first character of a token or to a null pointer if there is no token.


2 Answers

The standard's specification of strtok (7.24.5.8) is pretty clear. In particular paragraph 4 (emphasis added by me) is directly relevant to the question, if I understand that correctly:

3 The first call in the sequence searches the string pointed to by s1 for the first character that is not contained in the current separator string pointed to by s2. If no such character is found, then there are no tokens in the string pointed to by s1 and the strtok function returns a null pointer. If such a character is found, it is the start of the first token.

4 The strtok function then searches from there for a character that is contained in the current separator string. If no such character is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token will return a null pointer. If such a character is found, it is overwritten by a null character, which terminates the current token. The strtok function saves a pointer to the following character, from which the next search for a token will start.

In a call

char *where = strtok(string_or_NULL, delimiters);

the token (a pointer to which is) returned - if any - extends from the first non-delimiter character found from the starting position (inclusive) until the next delimiter character (exclusive), if one exists, or the end of the string, if no later delimiter character exists.

The linked description doesn't explicitly mention the case of a token extending until the end of the string, as opposed to the standard, so it is incomplete in that respect.

like image 173
Daniel Fischer Avatar answered Oct 01 '22 05:10

Daniel Fischer


Going to the description in POSIX for strtok(), the description says:

char *strtok(char *restrict s1, const char *restrict s2);

A sequence of calls to strtok() breaks the string pointed to by s1 into a sequence of tokens, each of which is delimited by a byte from the string pointed to by s2. The first call in the sequence has s1 as its first argument, and is followed by calls with a null pointer as their first argument. The separator string pointed to by s2 may be different from call to call.

The first call in the sequence searches the string pointed to by s1 for the first byte that is not contained in the current separator string pointed to by s2. If no such byte is found, then there are no tokens in the string pointed to by s1 and strtok() shall return a null pointer. If such a byte is found, it is the start of the first token.

The strtok() function then searches from there for a byte that is contained in the current separator string. If no such byte is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token shall return a null pointer. If such a byte is found, it is overwritten by a NUL character, which terminates the current token. The strtok() function saves a pointer to the following byte, from which the next search for a token shall start.

Note the second sentence of the third paragraph:

If no such byte is found, the current token extends to the end of the string pointed to by s1, and subsequent searches for a token shall return a null pointer.

This clearly states that in the example in the question, Toad is indeed a token. One way to think of it is that the list of delimiters always includes the NUL '\0' at the end of the delimiter string.


Having diagnosed that, note that strtok() is not a good function to use — it is not thread safe or reentrant. On Windows, you can use strtok_s() instead; on Unix, you can usually use strtok_r(). These are better functions because they don't store internally the pointer at which the search is to resume.

Because strtok() is not reentrant, you cannot call a function that uses strtok() from inside a function that itself uses strtok() while it is using strtok(). Also, any library function that uses strtok() must be clearly identified as doing so because it cannot be called from a function that is using strtok(). So, using strtok() makes life hard.

The other problem with the strtok() family of functions (and with strsep(), which is related) is that they overwrite the delimiter; you can't find out what the delimiter was after the tokenizer has tokenized the string. This can matter in some applications (such as parsing shell command lines; it matters whether the delimiter is a pipe or a semicolon or an ampersand (or ...). So shell parsers usually don't use strtok(), despite the number of questions on SO about shells where the parser does use strtok().

Generally, you should steer clear of plain strtok(), and it is up to you to decide whether strtok_r() or strtok_s() is appropriate for your purposes.

like image 23
Jonathan Leffler Avatar answered Oct 01 '22 03:10

Jonathan Leffler