As part of a protocol I'm receiving C string of the following format: WORD * WORD Where both WORDs are the same given string. And, * - is any string of printable characters, NOT including spaces! So the following are all legal: <ul> <li>WORD asjdfnkn WORD</li> <li>WORD 234kjk2nd32jk WORD</li> </ul> And the following are illegal: <ol> <li>WORD akldmWORD </li> <li>WORD asdm zz WORD</li> <li>NOTWORD admkas WORD</li> <li>NOTWORD admkas NOTWORD</li> </ol> Where (1) is missing a trailing space; (2) has 3 or more spaces; (3)/(4) do not open/end with the correct string (WORD). Of-course this could be implemented pretty straight-forward, however I'm not sure what I'm doing is the most efficient. Note: WORD is pre-set for a whole run, however could change from run to run. Currently I'm strncmping each string against "WORD ". If that checks manually (char-by-char) run over the string, to check for the second space char. [If found] I then strcmp (all the way) with "WORD". Would love to hear your solution, with an emphasis on efficiency as I'll be running over millions of theses in real-time.

I'd say, have a look at the algorithms in Handbook of Exact String-Matching Algorithms, compare the complexities and choose the one that you like best, implement it. Or you can use some ready-made implementations. You have some really classical algorithms for searching strings inside another string here: KMP(Knuth-Morris-Pratt) Rabin-Karp Boyer-Moore Hope this helps :)

Have you profiled? There's not much gain to be had here, since you're doing basic string comparisons. If you want to go for the last few percent of performance, I'd change out the <code>str...</code> functions for <code>mem...</code> functions. <pre class="prettyprint"><code>char *bufp, *bufe; // pointer to buffer, one past end of buffer if (bufe - bufp < wordlen * 2 + 2) error(); if (memcmp(bufp, word, wordlen) || bufp[wordlen] != ' ') error(); bufp += wordlen + 1; char *datap = bufp; char *datae = memchr(bufp, ' ', bufe - buf); if (!datae || bufe - datae < wordlen + 1) error(); if (memcmp(datae + 1, word, wordlen)) error(); // Your data is in the range [datap, datae). </code></pre> The performance gains are likely less than spectacular. You have to examine each character in the buffer since each character could be a space, and any character in the delimiters could be wrong. Changing a loop to <code>memchr</code> is slick, but modern compilers know how to do that for you. Changing a <code>strncmp</code> or <code>strcmp</code> to <code>memcmp</code> is also probably going to be negligible.

There is probably a tradeoff to be made between the shortest code and the fastest implementation. Choices are: <ol> <li>The regular expression <code>^WORD \S+ WORD$</code> (requires a regex engine)</li> <li><code>strchr</code> on <code>"WORD "</code> and a <code>strrchr</code> on " WORD" with a lot of messy checks (not really recommended)</li> <li>Walking the whole string character by character, keeping track of the state you are in (scanning first word, scanning first space, scanning middle, scanning last space, scanning last word, expecting end of string).</li> </ol> Option 1 requires the least code but backtracks near the end, and Option 2 has no redeeming qualities. I think you can do option 3 elegantly. Use a state variable and it will look okay. Remember to manually enter the last two states based on the length of your word and the length of your overall string and this will avoid the backtracking that a regex will most likely have.

Fast C comparison

Tags:

c

substring

comparison

As part of a protocol I'm receiving C string of the following format:
WORD * WORD
Where both WORDs are the same given string.
And, * - is any string of printable characters, NOT including spaces!

So the following are all legal:

WORD asjdfnkn WORD
WORD 234kjk2nd32jk WORD

And the following are illegal:

WORD akldmWORD
WORD asdm zz WORD
NOTWORD admkas WORD
NOTWORD admkas NOTWORD

Where (1) is missing a trailing space; (2) has 3 or more spaces; (3)/(4) do not open/end with the correct string (WORD).

Of-course this could be implemented pretty straight-forward, however I'm not sure what I'm doing is the most efficient. Note: WORD is pre-set for a whole run, however could change from run to run.

Currently I'm strncmping each string against "WORD ". If that checks manually (char-by-char) run over the string, to check for the second space char.
[If found] I then strcmp (all the way) with "WORD".

Would love to hear your solution, with an emphasis on efficiency as I'll be running over millions of theses in real-time.

645

asked Jul 10 '11 21:07

Trevor

3 Answers

I'd say, have a look at the algorithms in Handbook of Exact String-Matching Algorithms, compare the complexities and choose the one that you like best, implement it.

Or you can use some ready-made implementations.

You have some really classical algorithms for searching strings inside another string here:

KMP(Knuth-Morris-Pratt)

Rabin-Karp

Boyer-Moore

Hope this helps :)

185

answered Oct 26 '22 06:10

wsdookadr

Have you profiled?

There's not much gain to be had here, since you're doing basic string comparisons. If you want to go for the last few percent of performance, I'd change out the str... functions for mem... functions.

char *bufp, *bufe; // pointer to buffer, one past end of buffer
if (bufe - bufp < wordlen * 2 + 2)
    error();
if (memcmp(bufp, word, wordlen) || bufp[wordlen] != ' ')
    error();
bufp += wordlen + 1;
char *datap = bufp;
char *datae = memchr(bufp, ' ', bufe - buf);
if (!datae || bufe - datae < wordlen + 1)
    error();
if (memcmp(datae + 1, word, wordlen))
    error();
// Your data is in the range [datap, datae).

The performance gains are likely less than spectacular. You have to examine each character in the buffer since each character could be a space, and any character in the delimiters could be wrong. Changing a loop to memchr is slick, but modern compilers know how to do that for you. Changing a strncmp or strcmp to memcmp is also probably going to be negligible.

answered Oct 26 '22 06:10

Dietrich Epp

There is probably a tradeoff to be made between the shortest code and the fastest implementation. Choices are:

The regular expression ^WORD \S+ WORD$ (requires a regex engine)
strchr on "WORD " and a strrchr on " WORD" with a lot of messy checks (not really recommended)
Walking the whole string character by character, keeping track of the state you are in (scanning first word, scanning first space, scanning middle, scanning last space, scanning last word, expecting end of string).

Option 1 requires the least code but backtracks near the end, and Option 2 has no redeeming qualities. I think you can do option 3 elegantly. Use a state variable and it will look okay. Remember to manually enter the last two states based on the length of your word and the length of your overall string and this will avoid the backtracking that a regex will most likely have.

answered Oct 26 '22 06:10

Ray Toal

Related questions
                            
                                C Dereference void* pointer
                            
                                Python C API: PyEval_CallFunction?
                            
                                signed two's complement arithmetic
                            
                                Is underscore allowed in case labels?
                            
                                Parse SIP packet in C
                            
                                "Inline C"-question
                            
                                Mixing C and objective-C
                            
                                Why link libraries (like pthread) when they are in the right folder "/lib" and "/usr/lib"?
                            
                                Did languages before C/C++ have pointers?
                            
                                How can I speed up crc32 calculation?
                            
                                Parent directory of a file
                            
                                Initializing multidimensional array of single value
                            
                                Defined behavior, passing character to printf("%02X"
                            
                                Save X509 certificate to a file
                            
                                How to detect a tab in a text file?
                            
                                Floating point exceptions - gcc bug?
                            
                                8bit to 16bit conversion
                            
                                C++ to Java conversion question about extern "C"
                            
                                How to pause FFmpeg from C++ code?
                            
                                how can I convert non atomic operation to atomic

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast C comparison

Tags:

c

substring

comparison

Trevor

People also ask

3 Answers

wsdookadr

Dietrich Epp

Ray Toal

Recent Activity

Donate For Us