string.IndexOf() not recognizing modified characters

Question

When using IndexOf to find a char which is followed by a large valued char (e.g. char 700 which is ʼ) then the IndexOf fails to recognize the char you are looking for.

e.g.

string find = "abcʼabcabc";   
int index = find.IndexOf("c");

In this code, index should be 2, but it returns 6.

Is there a way to get around this?

Mark Sowul · Accepted Answer

Unicode letter 700 is a modifier apostrophe: in other words, it modifies the letter c. In the same way, if you were to use an 'e' followed by character 769 (0x301), it would not really be an 'e' anymore: the e has been modified to be e with an acute accent. To wit: é. You'll see that letter is actually two characters: copy it to notepad and hit backspace (neat, huh?).

You need to do an "Ordinal" comparison (byte-by-byte) without any linguistic comparison. That will find the 'c', and ignore the linguistic fact that it is modified by the next letter. In my 'e' example, the bytes are (65)(769), so if you go byte-by-byte looking for 65, you will find it, and that ignores the fact that (65)(769) is linguistically the same as (233): é. If you search for (233) linguistically it will find the "equivalent" (65)(769):

string find = "abéabcabc";
int index = find.IndexOf("é"); //gives you '2' even though the "find" has two characters and the the "indexof" is one

Hopefully that's not too confusing. If you're doing this in real code you should explain in comments exactly what you're doing: as in my 'e' example generally you would want to do semantic equivalence for user data, and ordinal equivalence for e.g. constants (which hopefully wouldn't be different like this, lest your successor hunt you down with an axe).

Loofer · Answer

The cʼ construct is being handled as linguistically different to the simple bytes. Use the Ordinal string comparison to force a byte comparison.

        string find = "abcʼabcabc";

        int index = find.IndexOf("c", StringComparison.Ordinal);

string.IndexOf() not recognizing modified characters

Tags:

c#

indexof

puser

2 Answers

Mark Sowul

Loofer

Recent Activity

Donate For Us

string.IndexOf() not recognizing modified characters

Tags:

c#

indexof

puser

2 Answers

Mark Sowul

Loofer

Related questions

Recent Activity

Donate For Us