Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can this method to convert a name to proper case be improved?

I am writing a basic function to convert millions of names, in a one-time batch process, from their current uppercase form to a proper mixed case. I came up with the following function:

public string ConvertToProperNameCase(string input)
{
    char[] chars = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(input.ToLower()).ToCharArray();

    for (int i = 0; i + 1 < chars.Length; i++)
    {
        if ((chars[i].Equals('\'')) ||
            (chars[i].Equals('-')))
        {                    
            chars[i + 1] = Char.ToUpper(chars[i + 1]);
        }
    }
    return new string(chars);
}

It works in most cases such as:

  1. JOHN SMITH → John Smith
  2. SMITH, JOHN T → Smith, John T
  3. JOHN O'BRIAN → John O'Brian
  4. JOHN DOE-SMITH → John Doe-Smith

There are some edge cases that do not work:

  1. JASON MCDONALD → Jason Mcdonald (Correct: Jason McDonald)
  2. OSCAR DE LA HOYA → Oscar De La Hoya (Correct: Oscar de la Hoya)
  3. MARIE DIFRANCO → Marie Difranco (Correct: Marie DiFranco)

These are not captured and I am not sure if I can handle all these odd edge cases. How can I change or add to capture more edge cases? I am sure there are tons of edge cases I am not even thinking of, as well. All casing should following North American conventions too, meaning that if certain countries expect a different capitalization format, then the North American format takes precedence.

like image 425
Kelsey Avatar asked Apr 30 '10 16:04

Kelsey


2 Answers

I think you'll run again a wall here because usually you won't be able to judge correctly if a conversion is reasonable or not.

Consider your edge cases

JASON MCDONALD -> Jason Mcdonald (Correct: Jason McDonald)

You could simply check for Mc at the beginning of your name and then apply your correction, right? But what if your person is named Mcizck (I made that up of course) and that should not be corrected to Mc Izck but should be left as is?

There is no 100% perfect solution to this problem. What you have here is a natural language problem, and they are really difficult to solve especially for a computer. Cultures are too different to be modeled correctly. Even if you say North-American conventions take precedence you'll have a high percentage of "false positives". Our society consists of a huge mix of cultures, it is simply not adequate to say "North-American takes precedence".

Without handling the edge cases, I guess your current solution will work 99% of the time. All further edge cases should be corrected manually if 100% correct names are really required.

like image 51
Johannes Rudolph Avatar answered Nov 07 '22 23:11

Johannes Rudolph


I hope that the reason you're doing this conversion is because the software is changing to allow the users to input their names with the correct casing in the first place.

That said, the only dependable solution would be to notify the users that you have changed the representation of their name. They can then edit the casing if it is incorrect. (You could call them, email them, wait until they use your software the next time, etc.)

If you can't let the users update their own names, the second most dependable method would be to collect lists of (last) names from public sources. If you can find enough of these, you should be able to cover more of the edge cases - simply see if the name exists in your properly-cased list, then use that casing.

like image 43
John Fisher Avatar answered Nov 08 '22 00:11

John Fisher