I hear this from a lot of programmers that the use of strtok maybe deprecated in near future. Some say it is still. Why is it a bad choice? strtok() works great in tokenizing a given string. Does it have to do anything with the time and space complexities? Best link I found on the internet was this. But that doesn't seem to solve my curiousity. Suggest any alternatives if possible.
strtok is neither thread safe nor re-entrant because it uses a static buffer while parsing. This means that if a function calls strtok , no function that it calls while it is using strtok can also use strtok , and it cannot be called by any function that is itself using strtok .
Personally, my biggest problem with strtok was that it was a destructive function.
The strtok() function reads string1 as a series of zero or more tokens, and string2 as the set of characters serving as delimiters of the tokens in string1. The tokens in string1 can be separated by one or more of the delimiters from string2.
If such a character is not found, there are no tokens in the string. strtok() returns a NULL pointer. The token ends with the first character contained in the string pointed to by string2. If such a character is not found, the token ends at the terminating NULL character.
Why is it a bad choice?
The fundamental technique for solving problems by programming is to construct abstractions which can be used reliably to solve sub-problems, and then compose solutions to those sub-problems into solutions to larger problems.
strtok's behaviour works directly against these goals in a variety of ways; it is a poor abstraction that is unreliable because it composes poorly.
The fundamental problem of tokenization is: given a position in a string, give the position of the end of the token beginning at that position. If strtok did only that, it would be great. It would have a clear abstraction, it would not rely on hidden global state, it would not modify its inputs.
To see the limitations of strtok, imagine trying to tokenize a language where we wish to separate tokens by spaces, unless the token is enclosed in "
"
, in which case we wish to apply a different tokenization rule to the contents of the quoted area, and then pick up with the space separation rule after. strtok composes very poorly with itself, and is therefore only useful for the most trivial of tokenization tasks.
Does it have to do anything with the time and space complexities?
No.
Suggest any alternatives if possible.
Lexers are not hard to write; just write one!
Bonus points if you write an immutable lexer. An immutable lexer is a little struct that contains a reference to the string being lexed, the current position of the lexer, and any state needed by the lexer. To extract a token you call a "next token" method, pass in the lexer, and you get back the token and a new lexer. The new lexer can then be used to lex the next token, and you discard the previous lexer if you wish.
The immutable lexer technique is easier to reason about than lexers which modify state. And you can debug them by saving the discarded lexers in a list, and now you have the complete history of tokenization operations open to inspection at once.
The limitation of strtok(char *str, const char *delim)
is that it can't work on multiple strings simultaneously as it maintains a static pointer to store the index till it has parsed (hence sufficient if playing with only one string at a time). The better and safer method is to use strtok_r(char *str, const char *delim, char **saveptr)
which explicitly takes a third pointer to save the parsed index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With