I have a program that's removes every variables inside a string. Theses variables start with '$'. So, for, example, if I give a string like [1,2,$1,$2], it should return just [1,2].
However, which loop is better for perfomance?
This:
while (token != NULL)
{
if (*token != '$')
{
sprintf(dst, "%s,%s", dst, token);
}
token = strtok(NULL, "], ");
}
or this:
while (token != NULL)
{
if (*token != '$')
{
strcat(dst, token);
strcat(dst, ",");
}
token = strtok(NULL, "], ");
}
According to C11 standard 7.21.6.6 p2, describing sprintf
:
If copying takes place between objects that overlap, the behavior is undefined.
So your first snippet is invoking an undefined behavior as it is copying from dst
to dst
. The strcat
approach does not have this issue.
strtok
is destructive, so the input string will not be usable after this code is executed. That being the case, you might as well do the transformation in place. This has a couple of advantages, one of which being that you do not need to allocate any memory (because the final string cannot be longer than the original string). This also requires a little bit of extra bookkeeping, but that provides another advantage: the resulting function's execution time is linear in the size of the input. Restarting the scan of the output buffer from the beginning on each iteration -- as both of your solutions do -- makes the function quadratic in the length of the string. [Note 1] The use of a quadratic algorithm is much more serious than minor differences in the cost of alternative standard library calls.
As various people have mentioned, it is Undefined Behaviour to call sprintf
in such a way that there is overlap between the output buffer and one of the strings to be printed. So your usage of sprintf
is incorrect even though it might appear to work on some implementation.
Neither strcat
nor sprintf
protects you from buffer overflow. You could use snprintf
(by placing the new string at the end of the accumulated buffer, instead of overwriting the beginning of the buffer with itself on each iteration) or you could use strncat
, but in both cases you'd need to do some extra bookkeeping.
So here's a quick implementation of the algorithm proposed in the first point. Note that it does not call malloc()
nor does it allocate any string on the stack. Also note that it uses memmove
rather than memcpy
to move the newly-discovered token forward in the string in order to avoid problems if the token and its destination were to overlap. (memmove
allows an overlap; memcpy
, strcpy
and strcat
do not.)
/* Compresses the string str in place by removing leading and trailing separator
* characters (which are the characters in 'fs') and replacing any interior
* sequence of separator characters with a single instance of 'ofs'.
* Returns 'str'.
*/
char* compress(char* str, const char* fs, char ofs) {
char* out = str;
char* token = strtok(str, fs);
while (token != NULL) {
size_t tlen = strlen(token);
memmove(out, token, tlen);
out += tlen;
*out++ = ofs;
token = strtok(NULL, fs);
}
/* Overwrite the last delimiter (if there was one) with a NUL */
if (out != str) --out;
*out = 0;
return str;
}
Unlike the original, this does not discard tokens starting with a $
. That would be trivial to add.
Also unlike the original, this function goes to the trouble of avoiding a trailing ,
. Again, that would be easy to change if there were a good reason. (However, the trailing comma means that strings with just one token will end up being one character longer, so the in-place guarantee can't be made.)
I chose to return the address of the start of the compressed string (which is the same as the address of the input buffer) to be consistent with various standard C interfaces. However, in many cases it would be more useful to return out
(which is the address of the trailing NUL), in order to allow further concatenation without having to compute the new length of the string. Alternatively, one could return the new length of the string, as sprintf
does (return out - str;
)
This API is honest about the fact that it destroys the original string (by overwriting it with the transformed one); functions which simply call strtok
on their input but return a separate output can cause subtle bugs because it is not obvious to the caller that the original string is destroyed. While it is not possible to restore a string after strtok
has been called on it, it is easy to convert an in-place algorithm into a non-destructive algorithm, by simply copying the original string:
/* Returns freshly allocated memory; caller is responsible for freeing it */
char* compress_and_copy(const char* str, const char* fs, char ofs) {
return compress(strdup(str), fs, ofs);
}
It's of course possible that the original unsimplified function did not have the feature that it was guaranteed to produce a shorter string; for example, it might be expanding segments starting with a $
by replacing them with the value of a variable. In this case, it would be necessary to produce a new string.
In some cases, maybe even most cases, the output will still be shorter than the input. But one should resist the temptation to do the transformation in place if possible and allocate a new string only if necessary. Although it might be more efficient, it complicates the rules for allocation ownership; you end up having to say "caller owns the returned string only if it is different from the original string", which is clumsy and accident-prone.
So if that's the actual use case, the optimal solution (from the perspective of clean API design) is to use strspn()
and strcspn()
to non-destructively walk the original string. That's a little more work, because it needs even more bookkeeping; on the other hand, it avoids the need to recompute strlen(token)
after the token has been identified.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With