Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Length of substring matched by culture-sensitive String.IndexOf method

I tried writing a culture-aware string replacement method:

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    return index >= 0
        ? text.Substring(0, index) + newValue + text.Substring(index + oldValue.Length)
        : text;
}

However, it chokes on Unicode combining characters:

// \u0301 is Combining Acute Accent
Console.WriteLine(Replace("déf", "é", "o"));       // 1. CORRECT: dof
Console.WriteLine(Replace("déf", "e\u0301", "o")); // 2. INCORRECT: do
Console.WriteLine(Replace("de\u0301f", "é", "o")); // 3. INCORRECT: dóf

To fix my code, I need to know that in the second example, String.IndexOf matched only one character (é) even though it searched for two (e\u0301). Similarly, I need to know that in the third example, String.IndexOf matched two characters (e\u0301) even though it only searched for one (é).

How can I determine the actual length of the substring matched by String.IndexOf?

NOTE: Performing Unicode normalization on text and oldValue (as suggested by James Keesey) would accommodate combining characters, but ligatures would still be a problem:

Console.WriteLine(Replace("œf", "œ", "i"));  // 4. CORRECT: if
Console.WriteLine(Replace("œf", "oe", "i")); // 5. INCORRECT: i
Console.WriteLine(Replace("oef", "œ", "i")); // 6. INCORRECT: ief
like image 533
Michael Liu Avatar asked Dec 09 '13 20:12

Michael Liu


People also ask

What does the indexOf () method do?

The indexOf() method returns the position of the first occurrence of specified character(s) in a string. Tip: Use the lastIndexOf method to return the position of the last occurrence of specified character(s) in a string.

How many indexOf () methods does the string class have?

Java String indexOf() There are four variants of indexOf() method.

What happens when indexOf is given a substring which does not appear?

The indexOf() method returns the position of the first occurrence of substring in string. The first position in the string is 0. If the indexOf() method does not find the substring in string, it will return -1.

Which is faster indexOf or contains?

NET 4.0 - IndexOf no longer uses Ordinal Comparison and so Contains can be faster.


2 Answers

You will need to directly call FindNLSString or FindNLSStringEx yourself. String.IndexOf uses FindNLSStringEx but all the information you need is available in FindNLSString.

Here is an example of how to rewrite your Replace method that works against your test cases. Note that I am using the current user locale read up the API documentation if you want to use the system locale or provide your own. I am also passing in 0 for the flags which means it will use the default string comparison options for the locale, again the documentation can help you provide different options.

public const int LOCALE_USER_DEFAULT = 0x0400;

[DllImport("kernel32.dll", SetLastError = true, ExactSpelling = true)]
internal static extern int FindNLSString(int locale, uint flags, [MarshalAs(UnmanagedType.LPWStr)] string sourceString, int sourceCount, [MarshalAs(UnmanagedType.LPWStr)] string findString, int findCount, out int found);

public static string ReplaceWithCombiningCharSupport(string text, string oldValue, string newValue)
{
    int foundLength;
    int index = FindNLSString(LOCALE_USER_DEFAULT, 0, text, text.Length, oldValue, oldValue.Length, out foundLength);
    return index >= 0 ? text.Substring(0, index) + newValue + text.Substring(index + foundLength) : text;
}
like image 138
David Ewen Avatar answered Oct 24 '22 09:10

David Ewen


I spoke too soon (and had never seen this method before) but there is an alternative. You can use the StringInfo.ParseCombiningCharacters() method to get the start of each actual character and use that to determine the length of the string to replace.


You will need to normalize both strings before you do the Index call. This will make sure that the source and target strings are the same length.

See the String.Normalize() reference page which describes this exact problem.

like image 26
James Keesey Avatar answered Oct 24 '22 08:10

James Keesey