<h3>Note</h3> The question below was asked in 2008 about some code from 2003. As the OP's update shows, this entire post has been obsoleted by vintage 2008 algorithms and persists here only as a historical curiosity. <hr> I need to do a fast case-insensitive substring search in C/C++. My requirements are as follows: <ul> <li>Should behave like strstr() (i.e. return a pointer to the match point).</li> <li>Must be case-insensitive (doh).</li> <li>Must support the current locale.</li> <li>Must be available on Windows (MSVC++ 8.0) or easily portable to Windows (i.e. from an open source library).</li> </ul> Here is the current implementation I am using (taken from the GNU C Library): <pre class="prettyprint"><code>/* Return the offset of one string within another. Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. The GNU C Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with the GNU C Library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. */ /* * My personal strstr() implementation that beats most other algorithms. * Until someone tells me otherwise, I assume that this is the * fastest implementation of strstr() in C. * I deliberately chose not to comment it. You should have at least * as much fun trying to understand it, as I had to write it :-). * * Stephen R. van den Berg, berg@pool.informatik.rwth-aachen.de */ /* * Modified to use table lookup instead of tolower(), since tolower() isn't * worth s*** on Windows. * * -- Anders Sandvig (anders@wincue.org) */ #if HAVE_CONFIG_H # include <config.h> #endif #include <ctype.h> #include <string.h> typedef unsigned chartype; char char_table[256]; void init_stristr(void) { int i; char string[2]; string[1] = '\0'; for (i = 0; i < 256; i++) { string[0] = i; _strlwr(string); char_table[i] = string[0]; } } #define my_tolower(a) ((chartype) char_table[a]) char * my_stristr (phaystack, pneedle) const char *phaystack; const char *pneedle; { register const unsigned char *haystack, *needle; register chartype b, c; haystack = (const unsigned char *) phaystack; needle = (const unsigned char *) pneedle; b = my_tolower (*needle); if (b != '\0') { haystack--; /* possible ANSI violation */ do { c = *++haystack; if (c == '\0') goto ret0; } while (my_tolower (c) != (int) b); c = my_tolower (*++needle); if (c == '\0') goto foundneedle; ++needle; goto jin; for (;;) { register chartype a; register const unsigned char *rhaystack, *rneedle; do { a = *++haystack; if (a == '\0') goto ret0; if (my_tolower (a) == (int) b) break; a = *++haystack; if (a == '\0') goto ret0; shloop: ; } while (my_tolower (a) != (int) b); jin: a = *++haystack; if (a == '\0') goto ret0; if (my_tolower (a) != (int) c) goto shloop; rhaystack = haystack-- + 1; rneedle = needle; a = my_tolower (*rneedle); if (my_tolower (*rhaystack) == (int) a) do { if (a == '\0') goto foundneedle; ++rhaystack; a = my_tolower (*++needle); if (my_tolower (*rhaystack) != (int) a) break; if (a == '\0') goto foundneedle; ++rhaystack; a = my_tolower (*++needle); } while (my_tolower (*rhaystack) == (int) a); needle = rneedle; /* took the register-poor approach */ if (a == '\0') break; } } foundneedle: return (char*) haystack; ret0: return 0; }</code></pre> Can you make this code faster, or do you know of a better implementation? Note: I noticed that the GNU C Library now has a new implementation of <code>strstr()</code>, but I am not sure how easily it can be modified to be case-insensitive, or if it is in fact faster than the old one (in my case). I also noticed that the old implementation is still used for wide character strings, so if anyone knows why, please share. Update Just to make things clear—in case it wasn't already—I didn't write this function, it's a part of the GNU C Library. I only modified it to be case-insensitive. Also, thanks for the tip about <code>strcasestr()</code> and checking out other implementations from other sources (like OpenBSD, FreeBSD, etc.). It seems to be the way to go. The code above is from 2003, which is why I posted it here in hope for a better version being available, which apparently it is. :)

use boost string algo. It is available, cross platform, and only a header file (no library to link in). Not to mention that you should be using boost anyway. <pre class="prettyprint"><code>#include <boost/algorithm/string/find.hpp> const char* istrstr( const char* haystack, const char* needle ) { using namespace boost; iterator_range<char*> result = ifind_first( haystack, needle ); if( result ) return result.begin(); return NULL; } </code></pre>

The code you posted is about half as fast as <code>strcasestr</code>. <pre class="prettyprint lang-none prettyprint-override"><code>$ gcc -Wall -o my_stristr my_stristr.c steve@solaris:~/code/tmp $ gcc -Wall -o strcasestr strcasestr.c steve@solaris:~/code/tmp $ ./bench ./my_stristr > my_stristr.result ; ./bench ./strcasestr > strcasestr.result; steve@solaris:~/code/tmp $ cat my_stristr.result run 1... time = 6.32 run 2... time = 6.31 run 3... time = 6.31 run 4... time = 6.31 run 5... time = 6.32 run 6... time = 6.31 run 7... time = 6.31 run 8... time = 6.31 run 9... time = 6.31 run 10... time = 6.31 average user time over 10 runs = 6.3120 steve@solaris:~/code/tmp $ cat strcasestr.result run 1... time = 3.82 run 2... time = 3.82 run 3... time = 3.82 run 4... time = 3.82 run 5... time = 3.82 run 6... time = 3.82 run 7... time = 3.82 run 8... time = 3.82 run 9... time = 3.82 run 10... time = 3.82 average user time over 10 runs = 3.8200 steve@solaris:~/code/tmp </code></pre> The <code>main</code> function was: <pre class="prettyprint lang-c prettyprint-override"><code>int main(void) { char * needle="hello"; char haystack[1024]; int i; for(i=0;i<sizeof(haystack)-strlen(needle)-1;++i) { haystack[i]='A'+i%57; } memcpy(haystack+i,needle, strlen(needle)+1); /*printf("%s\n%d\n", haystack, haystack[strlen(haystack)]);*/ init_stristr(); for (i=0;i<1000000;++i) { /*my_stristr(haystack, needle);*/ strcasestr(haystack,needle); } return 0; } </code></pre> It was suitably modified to test both implementations. I notice as I am typing this up I left in the <code>init_stristr</code> call, but it shouldn't change things too much. <code>bench</code> is just a simple shell script: <pre class="prettyprint lang-c prettyprint-override"><code>#!/bin/bash function bc_calc() { echo $(echo "scale=4;$1" | bc) } time="/usr/bin/time -p" prog="$1" accum=0 runs=10 for a in $(jot $runs 1 $runs) do echo -n "run $a... " t=$($time $prog 2>&1| grep user | awk '{print $2}') echo "time = $t" accum=$(bc_calc "$accum+$t") done echo -n "average user time over $runs runs = " echo $(bc_calc "$accum/$runs") </code></pre>

Fastest way to do a case-insensitive substring search in C/C++?

Q: How do you find case-insensitive strings?

Comparing strings in a case insensitive manner means to compare them without taking care of the uppercase and lowercase letters. To perform this operation the most preferred method is to use either toUpperCase() or toLowerCase() function. toUpperCase() function: The str.

Q: Is Strstr case-sensitive in C?

Yes. The strstr function searches for exact strings.

Q: Is std :: string find case-sensitive?

std::string provides a method std::string::find to search for the sub string inside a given string, but this function is case sensitive i.e.

Note

The question below was asked in 2008 about some code from 2003. As the OP's update shows, this entire post has been obsoleted by vintage 2008 algorithms and persists here only as a historical curiosity.

I need to do a fast case-insensitive substring search in C/C++. My requirements are as follows:

Should behave like strstr() (i.e. return a pointer to the match point).
Must be case-insensitive (doh).
Must support the current locale.
Must be available on Windows (MSVC++ 8.0) or easily portable to Windows (i.e. from an open source library).

Here is the current implementation I am using (taken from the GNU C Library):

/* Return the offset of one string within another.
   Copyright (C) 1994,1996,1997,1998,1999,2000 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

/*
 * My personal strstr() implementation that beats most other algorithms.
 * Until someone tells me otherwise, I assume that this is the
 * fastest implementation of strstr() in C.
 * I deliberately chose not to comment it.  You should have at least
 * as much fun trying to understand it, as I had to write it :-).
 *
 * Stephen R. van den Berg, [email protected] */

/*
 * Modified to use table lookup instead of tolower(), since tolower() isn't
 * worth s*** on Windows.
 *
 * -- Anders Sandvig ([email protected])
 */

#if HAVE_CONFIG_H
# include <config.h>
#endif

#include <ctype.h>
#include <string.h>

typedef unsigned chartype;

char char_table[256];

void init_stristr(void)
{
  int i;
  char string[2];

  string[1] = '\0';
  for (i = 0; i < 256; i++)
  {
    string[0] = i;
    _strlwr(string);
    char_table[i] = string[0];
  }
}

#define my_tolower(a) ((chartype) char_table[a])

char *
my_stristr (phaystack, pneedle)
     const char *phaystack;
     const char *pneedle;
{
  register const unsigned char *haystack, *needle;
  register chartype b, c;

  haystack = (const unsigned char *) phaystack;
  needle = (const unsigned char *) pneedle;

  b = my_tolower (*needle); 
  if (b != '\0')
  {
    haystack--;             /* possible ANSI violation */
    do
      {
        c = *++haystack;
        if (c == '\0')
          goto ret0;
      }
    while (my_tolower (c) != (int) b);

    c = my_tolower (*++needle);
    if (c == '\0')
        goto foundneedle;

    ++needle;
    goto jin;

    for (;;)
    {
      register chartype a;
        register const unsigned char *rhaystack, *rneedle;

        do
        {
          a = *++haystack;
          if (a == '\0')
              goto ret0;
          if (my_tolower (a) == (int) b)
              break;
          a = *++haystack;
          if (a == '\0')
              goto ret0;
        shloop:
          ;
        }
      while (my_tolower (a) != (int) b);

jin:      
      a = *++haystack;
      if (a == '\0')
          goto ret0;

        if (my_tolower (a) != (int) c)
          goto shloop;

        rhaystack = haystack-- + 1;
        rneedle = needle;

        a = my_tolower (*rneedle);

        if (my_tolower (*rhaystack) == (int) a)
          do
          {
              if (a == '\0')
                goto foundneedle;

              ++rhaystack;
          a = my_tolower (*++needle);
              if (my_tolower (*rhaystack) != (int) a)
                break;

          if (a == '\0')
                goto foundneedle;

          ++rhaystack;
              a = my_tolower (*++needle);
          }
          while (my_tolower (*rhaystack) == (int) a);

        needle = rneedle;       /* took the register-poor approach */

      if (a == '\0')
          break;
    }
  }
foundneedle:
  return (char*) haystack;
ret0:
  return 0;
}

Can you make this code faster, or do you know of a better implementation?

Note: I noticed that the GNU C Library now has a new implementation of strstr(), but I am not sure how easily it can be modified to be case-insensitive, or if it is in fact faster than the old one (in my case). I also noticed that the old implementation is still used for wide character strings, so if anyone knows why, please share.

Update

Just to make things clear—in case it wasn't already—I didn't write this function, it's a part of the GNU C Library. I only modified it to be case-insensitive.

Also, thanks for the tip about strcasestr() and checking out other implementations from other sources (like OpenBSD, FreeBSD, etc.). It seems to be the way to go. The code above is from 2003, which is why I posted it here in hope for a better version being available, which apparently it is. :)

410

asked Oct 17 '08 09:10

Anders Sandvig

4 Answers

use boost string algo. It is available, cross platform, and only a header file (no library to link in). Not to mention that you should be using boost anyway.

#include <boost/algorithm/string/find.hpp>

const char* istrstr( const char* haystack, const char* needle )
{
   using namespace boost;
   iterator_range<char*> result = ifind_first( haystack, needle );
   if( result ) return result.begin();

   return NULL;
}

answered Oct 24 '22 02:10

deft_code

The code you posted is about half as fast as strcasestr.

$ gcc -Wall -o my_stristr my_stristr.c
steve@solaris:~/code/tmp
$ gcc -Wall -o strcasestr strcasestr.c 
steve@solaris:~/code/tmp
$ ./bench ./my_stristr > my_stristr.result ; ./bench ./strcasestr > strcasestr.result;
steve@solaris:~/code/tmp
$ cat my_stristr.result 
run 1... time = 6.32
run 2... time = 6.31
run 3... time = 6.31
run 4... time = 6.31
run 5... time = 6.32
run 6... time = 6.31
run 7... time = 6.31
run 8... time = 6.31
run 9... time = 6.31
run 10... time = 6.31
average user time over 10 runs = 6.3120
steve@solaris:~/code/tmp
$ cat strcasestr.result 
run 1... time = 3.82
run 2... time = 3.82
run 3... time = 3.82
run 4... time = 3.82
run 5... time = 3.82
run 6... time = 3.82
run 7... time = 3.82
run 8... time = 3.82
run 9... time = 3.82
run 10... time = 3.82
average user time over 10 runs = 3.8200
steve@solaris:~/code/tmp

The main function was:

int main(void)
{
        char * needle="hello";
        char haystack[1024];
        int i;

        for(i=0;i<sizeof(haystack)-strlen(needle)-1;++i)
        {
                haystack[i]='A'+i%57;
        }
        memcpy(haystack+i,needle, strlen(needle)+1);
        /*printf("%s\n%d\n", haystack, haystack[strlen(haystack)]);*/
        init_stristr();

        for (i=0;i<1000000;++i)
        {
                /*my_stristr(haystack, needle);*/
                strcasestr(haystack,needle);
        }


        return 0;
}

It was suitably modified to test both implementations. I notice as I am typing this up I left in the init_stristr call, but it shouldn't change things too much. bench is just a simple shell script:

#!/bin/bash
function bc_calc()
{
        echo $(echo "scale=4;$1" | bc)
}
time="/usr/bin/time -p"
prog="$1"
accum=0
runs=10
for a in $(jot $runs 1 $runs)
do
        echo -n "run $a... "
        t=$($time $prog 2>&1| grep user | awk '{print $2}')
        echo "time = $t"
        accum=$(bc_calc "$accum+$t")
done

echo -n "average user time over $runs runs = "
echo $(bc_calc "$accum/$runs")

answered Oct 24 '22 01:10

freespace

For platform independent use:

const wchar_t *szk_wcsstri(const wchar_t *s1, const wchar_t *s2)
{
    if (s1 == NULL || s2 == NULL) return NULL;
    const wchar_t *cpws1 = s1, *cpws1_, *cpws2;
    char ch1, ch2;
    bool bSame;

    while (*cpws1 != L'\0')
    {
        bSame = true;
        if (*cpws1 != *s2)
        {
            ch1 = towlower(*cpws1);
            ch2 = towlower(*s2);

            if (ch1 == ch2)
                bSame = true;
        }

        if (true == bSame)
        {
            cpws1_ = cpws1;
            cpws2 = s2;
            while (*cpws1_ != L'\0')
            {
                ch1 = towlower(*cpws1_);
                ch2 = towlower(*cpws2);

                if (ch1 != ch2)
                    break;

                cpws2++;

                if (*cpws2 == L'\0')
                    return cpws1_-(cpws2 - s2 - 0x01);
                cpws1_++;
            }
        }
        cpws1++;
    }
    return NULL;
}

answered Oct 24 '22 02:10

Suzuki Keem

You can use StrStrI function which finds the first occurrence of a substring within a string. The comparison is not case-sensitive. Don't forget to include its header - Shlwapi.h. Check this out: http://msdn.microsoft.com/en-us/library/windows/desktop/bb773439(v=vs.85).aspx

answered Oct 24 '22 02:10

Nitin Chhabra

Related questions
                            
                                About using an undocumented class in Qt
                            
                                changing probability of getting a random number
                            
                                Why float variable saves value by cutting digits after point in a weird way?
                            
                                Pointer to const char vs char array vs std::string
                            
                                Should I include header file within a namespace?
                            
                                Using of rvalue references in c++11
                            
                                Calling template methods in template classes
                            
                                Preventing C integer overflow [closed]
                            
                                Member variable of type std::array<T, ?>
                            
                                Difference between irange and counting_range in Boost
                            
                                Unable to compile C++ program [closed]
                            
                                Template Argument Type Deduction Fails with C++11 <type_traits>
                            
                                What is the difference between if (NULL == pointer) vs if (pointer == NULL)?
                            
                                Black color object detection HSV range in opencv
                            
                                Why pass by value and not by const reference?
                            
                                Is there any difference between delete x and delete(x)?
                            
                                Same function name in different namespaces
                            
                                Returning unique_ptr in Factory
                            
                                C++ alternatives to std::array when the size is fixed, but not a constexpr?
                            
                                Make C++ fail compilation on specific instantiation of template function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to do a case-insensitive substring search in C/C++?

Tags:

c++

c

string

optimization

glibc

Note

Anders Sandvig

People also ask

4 Answers

deft_code

freespace

Suzuki Keem

Nitin Chhabra

Recent Activity

Donate For Us