Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error with IndexOf C#

Tags:

c#

.net

I'm obviously missing something here..

I'm writing a function that returns the number of substrings delimited by a particular string. Here is the rather simple function -

public static FuncError DCount(String v1, String v2, ref Int32 result) {
        result = 0;
        if (String.IsNullOrEmpty(v1)) {
            return null;
        }
        if (String.IsNullOrEmpty(v2)) {
            return null;
        }

        int ct = 1;
        int ix = 0;
        int nix = 0;

        do {
            nix = v1.IndexOf(v2, ix);
            if (nix >= 0) {
                ct++;

                System.Diagnostics.Debug.Print(
string.Format("{0} found at {1} count={2} result = {3}",
v2, nix, ct, v1.Substring(nix,1)));
                ix = nix + v2.Length;
            }
        } while (nix >= 0);
        result = ct;
        return null;
    }

The problem comes when I call with a special character that is being used as a separator in a particular situation. It's returning lots of false positives. From the Debug.Print the first and the last argument should always be the same.

þ found at 105 count=2 result = t
þ found at 136 count=3 result = t
þ found at 152 count=4 result = þ
þ found at 249 count=5 result = t
þ found at 265 count=6 result = t
þ found at 287 count=7 result = t
þ found at 317 count=8 result = t
þ found at 333 count=9 result = þ
þ found at 443 count=10 result = þ
þ found at 553 count=11 result = þ
þ found at 663 count=12 result = þ
þ found at 773 count=13 result = þ
þ found at 883 count=14 result = þ
þ found at 993 count=15 result = þ

If I pass the þ as a char it works fine. If I split the string using þ as a delimiter it returns the correct number of elements. As for the incorrectly identified 't', there are other 't's in the results that are not being picked up, so it's not a character conversion issue.

Confused ...

Thanks

like image 489
baffled Avatar asked Jan 13 '23 22:01

baffled


2 Answers

The problem here is how different cultures represent characters, and in some cases combine them.

The letter you're searching for, Thorn, can apparently be represented by the th letters.

Try this code in LINQPad:

void Main()
{
    string x = "uma thurman";
    x.IndexOf("þ").Dump();
}

It will output 4.

(Note that I run this program on a machine in Norway, it may or may not have an impact on the results)

This is the same "problem" as the german letter for double S - ß - can be found in words with two s's together, in some cultures.

like image 130
Lasse V. Karlsen Avatar answered Jan 25 '23 04:01

Lasse V. Karlsen


You can use StringComparison.Ordinal to get culture agnostic string matching.

using Lasse V. Karlsen's example:

string x = "uma thurman";
x.IndexOf("þ", StringComparison.Ordinal).Dump();

Will result in -1.

See Best Practices for Using Strings in the .NET Framework for more information.

like image 33
Dustin Kingen Avatar answered Jan 25 '23 06:01

Dustin Kingen