Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast C comparison

As part of a protocol I'm receiving C string of the following format:
WORD * WORD
Where both WORDs are the same given string.
And, * - is any string of printable characters, NOT including spaces!

So the following are all legal:

  • WORD asjdfnkn WORD
  • WORD 234kjk2nd32jk WORD

And the following are illegal:

  1. WORD akldmWORD
  2. WORD asdm zz WORD
  3. NOTWORD admkas WORD
  4. NOTWORD admkas NOTWORD

Where (1) is missing a trailing space; (2) has 3 or more spaces; (3)/(4) do not open/end with the correct string (WORD).

Of-course this could be implemented pretty straight-forward, however I'm not sure what I'm doing is the most efficient. Note: WORD is pre-set for a whole run, however could change from run to run.

Currently I'm strncmping each string against "WORD ". If that checks manually (char-by-char) run over the string, to check for the second space char.
[If found] I then strcmp (all the way) with "WORD".

Would love to hear your solution, with an emphasis on efficiency as I'll be running over millions of theses in real-time.

like image 645
Trevor Avatar asked Jul 10 '11 21:07

Trevor


People also ask

Can we compare two strings using == in C?

You can't compare strings in C with ==, because the C compiler does not really have a clue about strings beyond a string-literal.

How can we compare two strings in C?

We compare the strings by using the strcmp() function, i.e., strcmp(str1,str2). This function will compare both the strings str1 and str2. If the function returns 0 value means that both the strings are same, otherwise the strings are not equal.

Is strcmp efficient?

Typically, strcmp is well optimized and can do 32 or 64 bit comparisons for strings longer than 4/8 bytes depending on architecture.

How can I compare two dates in C?

Consider the problem of comparison of two valid dates d1 and d2. There are three possible outcomes of this comparison: d1 == d2 (dates are equal), d1 > d2 (date d1 is greater, i.e., occurs after d2) and d1 < d2(date d1 is smaller, i.e., occurs before d2).


3 Answers

I'd say, have a look at the algorithms in Handbook of Exact String-Matching Algorithms, compare the complexities and choose the one that you like best, implement it.

Or you can use some ready-made implementations.

You have some really classical algorithms for searching strings inside another string here:

KMP(Knuth-Morris-Pratt)

Rabin-Karp

Boyer-Moore

Hope this helps :)

like image 185
wsdookadr Avatar answered Oct 26 '22 06:10

wsdookadr


Have you profiled?

There's not much gain to be had here, since you're doing basic string comparisons. If you want to go for the last few percent of performance, I'd change out the str... functions for mem... functions.

char *bufp, *bufe; // pointer to buffer, one past end of buffer
if (bufe - bufp < wordlen * 2 + 2)
    error();
if (memcmp(bufp, word, wordlen) || bufp[wordlen] != ' ')
    error();
bufp += wordlen + 1;
char *datap = bufp;
char *datae = memchr(bufp, ' ', bufe - buf);
if (!datae || bufe - datae < wordlen + 1)
    error();
if (memcmp(datae + 1, word, wordlen))
    error();
// Your data is in the range [datap, datae).

The performance gains are likely less than spectacular. You have to examine each character in the buffer since each character could be a space, and any character in the delimiters could be wrong. Changing a loop to memchr is slick, but modern compilers know how to do that for you. Changing a strncmp or strcmp to memcmp is also probably going to be negligible.

like image 42
Dietrich Epp Avatar answered Oct 26 '22 06:10

Dietrich Epp


There is probably a tradeoff to be made between the shortest code and the fastest implementation. Choices are:

  1. The regular expression ^WORD \S+ WORD$ (requires a regex engine)

  2. strchr on "WORD " and a strrchr on " WORD" with a lot of messy checks (not really recommended)

  3. Walking the whole string character by character, keeping track of the state you are in (scanning first word, scanning first space, scanning middle, scanning last space, scanning last word, expecting end of string).

Option 1 requires the least code but backtracks near the end, and Option 2 has no redeeming qualities. I think you can do option 3 elegantly. Use a state variable and it will look okay. Remember to manually enter the last two states based on the length of your word and the length of your overall string and this will avoid the backtracking that a regex will most likely have.

like image 45
Ray Toal Avatar answered Oct 26 '22 06:10

Ray Toal