Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to replace characters in a array quickly

I am using a XML Text reader on a XML file that may contain characters that are invalid for the reader. My initial thought was to create my own version of the stream reader and clean out the bad characters but it is severely slowing down my program.

public class ClensingStream : StreamReader
{
        private static char[] badChars = { '\x00', '\x09', '\x0A', '\x10' };
    //snip
        public override int Read(char[] buffer, int index, int count)
        {
            var tmp = base.Read(buffer, index, count);

            for (int i = 0; i < buffer.Length; ++i)
            {
                //check the element in the buffer to see if it is one of the bad characters.
                if(badChars.Contains(buffer[i]))
                    buffer[i] = ' ';
            }

            return tmp;
        }
}

according to my profiler the code is spending 88% of its time in if(badChars.Contains(buffer[i])) what is the correct way to do this so I am not causing horrible slowness?

like image 327
Scott Chamberlain Avatar asked Dec 17 '22 16:12

Scott Chamberlain


2 Answers

The reason that it spends so much time in that line is because the Contains method loops through the array to look for the character.

Put the characters in a HashSet<char> instead:

private static HashSet<char> badChars =
  new HashSet<char>(new char[] { '\x00', '\x09', '\x0A', '\x10' });

The code to check if the set contains the character looks the same as when looking in the array, but it uses the hash code of the character to look for it instead of looping through all the items in the array.

Alternatively, you could put the characters in a switch, that way the compiler would create an efficient comparison:

switch (buffer[i]]) {
  case '\x00':
  case '\x09':
  case '\x0A':
  case '\x10': buffer[i] = ' '; break;
}

If you have more characters (five or six IIRC), the compiler will actually create a hash table to look up the cases, so that would be similar to using a HashSet.

like image 128
Guffa Avatar answered Dec 21 '22 22:12

Guffa


You might have better results with a switch statement:

switch (buffer[i])
{
    case '\x00':
    case '\x09':
    case '\x0A':
    case '\x10':
        buffer[i] = ' ';
        break;
}

This should be compiled down to fast code by the JIT compiler at runtime. Heck, the compiler might get close too. You don't need a method call this way either.

like image 43
Daren Thomas Avatar answered Dec 21 '22 23:12

Daren Thomas