Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

strange string.IndexOf behavour

I wrote the following snippet to get rid of excessive spaces in slabs of text

int index = text.IndexOf("  ");
while (index > 0)
{
    text = text.Replace("  ", " ");
    index = text.IndexOf("  ");
}

Generally this works fine, albeit rather primative and possibly inefficient.

Problem

When the text contains " - " for some bizzare reason the indexOf returns a match! The Replace function doesn't remove anything and then it is stuck in a endless loop.

Any ideas what is going on with the string.IndexOf?

like image 302
Andrew Harry Avatar asked Feb 04 '11 00:02

Andrew Harry


1 Answers

Ah, the joys of text.

What you most likely have there, but got lost when posting on SO, is a "soft hyphen".

To reproduce the problem, I tried this code in LINQPad:

void Main()
{
    var text = "Test1 \u00ad Test2";
    int index = text.IndexOf("  ");
    while (index > 0)
    {
        text = text.Replace("  ", " ");
        index = text.IndexOf("  ");
    }
}

And sure enough, the above code just gets stuck in a loop.

Note that \u00ad is the Unicode symbol for Soft Hyphen, according to CharMap. You can always copy and paste the character from CharMap as well, but posting it here on SO will replace it with its much more common cousin, the Hyphen-Minus, Unicode symbol u002d (the one on your keyboard.)

You can read a small section in the documentation for the String Class which has this to say on the subject:

String search methods, such as String.StartsWith and String.IndexOf, also can perform culture-sensitive or ordinal string comparisons. The following example illustrates the differences between ordinal and culture-sensitive comparisons using the IndexOf method. A culture-sensitive search in which the current culture is English (United States) considers the substring "oe" to match the ligature "œ". Because a soft hyphen (U+00AD) is a zero-width character, the search treats the soft hyphen as equivalent to Empty and finds a match at the beginning of the string. An ordinal search, on the other hand, does not find a match in either case.

I've highlighted the relevant part, but I also remember a blog post about this exact problem a while back but my Google-Fu is failing me tonight.

The problem here is that IndexOf and Replace use different methods for locating the text.

Whereas IndexOf will consider the soft hyphen as "not really there", and thus discover the two spaces on each side of it as "two joined spaces", the Replace method won't, and thus won't remove either of them. Therefore the criteria is present for the loop to continue iterating, but since Replace doesn't remove the spaces that fit the criteria, it will never end. Undoubtedly there are other such characters in the Unicode symbol space that exhibit similar problems, but this is the most typical case I've seen.

There's at least two ways of handling this:

  1. You can use Regex.Replace, which seems to not have this problem:

    text = Regex.Replace(text, "  +", " ");
    

    Personally I would probably use the whitespace special character in the Regular Expression, which is \s, but if you only want spaces, the above should do the trick.

  2. You can explicitly ask IndexOf to use an ordinal comparison, which won't get tripped up by text behaving like ... well ... text:

    index = text.IndexOf("  ", StringComparison.Ordinal);
    
like image 60
Lasse V. Karlsen Avatar answered Sep 28 '22 06:09

Lasse V. Karlsen