Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing several million char* to string conversions

I have an application that needs to take in several million char*'s as an input parameter (typically strings less than 512 characters (in unicode)), and convert and store them as .net strings.

It turning out to be a real bottleneck in the performance of my application. I'm wondering if there's some design pattern or ideas to make it more effecient.

There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.

For the record, here is the algorithm i'm using to convert char* to string (this algorithm is in C++, but the rest of the project is in C#)

String ^StringTools::MbCharToStr ( const char *Source ) 
{
   String ^str;

   if( (Source == NULL) || (Source[0] == '\0') )
   {
      str = gcnew String("");
   }
   else
   {
      // Find the number of UTF-16 characters needed to hold the
      // converted UTF-8 string, and allocate a buffer for them.
      const size_t max_strsize = 2048;

      int wstr_size = MultiByteToWideChar (CP_UTF8, 0L, Source, -1, NULL, 0);
      if (wstr_size < max_strsize)
      {
         // Save the malloc/free overhead if it's a reasonable size.
         // Plus, KJN was having fits with exceptions within exception logging due
         // to a corrupted heap.

         wchar_t wstr[max_strsize];

         (void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
         str = gcnew String (wstr);
      }
      else
      {
         wchar_t *wstr = (wchar_t *)calloc (wstr_size, sizeof(wchar_t));
         if (wstr == NULL) 
            throw gcnew PCSException (__FILE__, __LINE__, PCS_INSUF_MEMORY, MSG_SEVERE);

         // Convert the UTF-8 string into the UTF-16 buffer, construct the
         // result String from the UTF-16 buffer, and then free the buffer.

         (void) MultiByteToWideChar (CP_UTF8, 0L, Source, -1, wstr, (int) wstr_size);
         str = gcnew String ( wstr );
         free (wstr);
      }
   }
   return str;
}
like image 454
greggorob64 Avatar asked Jan 14 '13 19:01

greggorob64


2 Answers

You could use each character from the input string to feed a trie structure. At the leaves, have a single .NET string object. Then, when a char* comes in that you've seen previously, you can quickly find the existing .NET version without allocating any memory.

Pseudo-code:

  • start with an empty trie,
  • process a char* by searching the trie until you can go no further
  • add nodes until your entire char* has been encoded as nodes
  • at the leaf, attach an actual .NET string

The answer to this other SO question should get you started: How to create a trie in c#

like image 80
jimbo Avatar answered Oct 19 '22 06:10

jimbo


There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.

If this is the case, you may want to consider storing the "found" patterns within a map (such as using a std::map<const char*, gcroot<String^>> [though you'll need a comparer for the const char*), and use that to return the previously converted value.

There is an overhead to storing the map, doing the comparison, etc. However, this may be mitigated by the dramatically reduced memory usage (you can reuse the managed string instances), as well as saving the memory allocations (calloc/free). Also, using malloc instead of calloc would likely be a (very small) improvement, as you don't need to zero out the memory prior to calling MultiByteToWideChar.

like image 20
Reed Copsey Avatar answered Oct 19 '22 05:10

Reed Copsey