As part of a protocol I'm receiving C string of the following format:
WORD * WORD
Where both WORDs are the same given string.
And, * - is any string of printable characters, NOT including spaces!
So the following are all legal:
And the following are illegal:
Where (1) is missing a trailing space; (2) has 3 or more spaces; (3)/(4) do not open/end with the correct string (WORD).
Of-course this could be implemented pretty straight-forward, however I'm not sure what I'm doing is the most efficient. Note: WORD is pre-set for a whole run, however could change from run to run.
Currently I'm strncmping each string against "WORD ".
If that checks manually (char-by-char) run over the string, to check for the second space char.
[If found] I then strcmp (all the way) with "WORD".
Would love to hear your solution, with an emphasis on efficiency as I'll be running over millions of theses in real-time.
You can't compare strings in C with ==, because the C compiler does not really have a clue about strings beyond a string-literal.
We compare the strings by using the strcmp() function, i.e., strcmp(str1,str2). This function will compare both the strings str1 and str2. If the function returns 0 value means that both the strings are same, otherwise the strings are not equal.
Typically, strcmp is well optimized and can do 32 or 64 bit comparisons for strings longer than 4/8 bytes depending on architecture.
Consider the problem of comparison of two valid dates d1 and d2. There are three possible outcomes of this comparison: d1 == d2 (dates are equal), d1 > d2 (date d1 is greater, i.e., occurs after d2) and d1 < d2(date d1 is smaller, i.e., occurs before d2).
I'd say, have a look at the algorithms in Handbook of Exact String-Matching Algorithms, compare the complexities and choose the one that you like best, implement it.
Or you can use some ready-made implementations.
You have some really classical algorithms for searching strings inside another string here:
KMP(Knuth-Morris-Pratt)
Rabin-Karp
Boyer-Moore
Hope this helps :)
Have you profiled?
There's not much gain to be had here, since you're doing basic string comparisons. If you want to go for the last few percent of performance, I'd change out the str...
functions for mem...
functions.
char *bufp, *bufe; // pointer to buffer, one past end of buffer
if (bufe - bufp < wordlen * 2 + 2)
error();
if (memcmp(bufp, word, wordlen) || bufp[wordlen] != ' ')
error();
bufp += wordlen + 1;
char *datap = bufp;
char *datae = memchr(bufp, ' ', bufe - buf);
if (!datae || bufe - datae < wordlen + 1)
error();
if (memcmp(datae + 1, word, wordlen))
error();
// Your data is in the range [datap, datae).
The performance gains are likely less than spectacular. You have to examine each character in the buffer since each character could be a space, and any character in the delimiters could be wrong. Changing a loop to memchr
is slick, but modern compilers know how to do that for you. Changing a strncmp
or strcmp
to memcmp
is also probably going to be negligible.
There is probably a tradeoff to be made between the shortest code and the fastest implementation. Choices are:
The regular expression ^WORD \S+ WORD$
(requires a regex engine)
strchr
on "WORD "
and a strrchr
on " WORD" with a lot of messy checks (not really recommended)
Walking the whole string character by character, keeping track of the state you are in (scanning first word, scanning first space, scanning middle, scanning last space, scanning last word, expecting end of string).
Option 1 requires the least code but backtracks near the end, and Option 2 has no redeeming qualities. I think you can do option 3 elegantly. Use a state variable and it will look okay. Remember to manually enter the last two states based on the length of your word and the length of your overall string and this will avoid the backtracking that a regex will most likely have.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With