Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can .NET convert Unicode to ASCII to remove "smart quotes", etc?

Some of our users use e-mail clients that can't cope with Unicode, even when the encoding, etc. are properly set in the mail headers.

I'd like to 'normalise' the content they're receiving. The biggest problem we have is users copy'n'pasting content from Microsoft Word into our web application, which then forwards that content by e-mail - including fractions, smart quotes, and all the other extended Unicode characters that Word helpfully inserts for you.

I'm guessing there is no definitely solution for this, but before I sit down and start writing great big lookup tables, is there some built-in method that'll get me started?

There's basically three phases involved.

First, stripping accents from otherwise-normal letters - solution to this is here

This paragraph contains “smart quotes” and áccénts and ½ of the problem is fractions

goes to

This paragraph contains “smart quotes” and accents and ½ of the problem is fractions

Second, replacing single Unicode characters with their ASCII equivalent, to give:

This paragraph contains "smart quotes" and accents and ½ of the problem is fractions

This is the part where I'm hoping there's a solution before I implement my own. Finally, replacing specific characters with a suitable ASCII sequence - ½ to 1/2, and so on - which I'm pretty sure isn't natively supported by any kind of Unicode magic, but somebody might have written a suitable lookup table I can re-use.

Any ideas?

like image 620
Dylan Beattie Avatar asked May 28 '11 18:05

Dylan Beattie


People also ask

How do I convert Unicode to ASCII?

You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.

Can we convert Unicode to text?

World's simplest unicode tool. This browser-based utility converts fancy Unicode text back to regular text. All Unicode glyphs that you paste or enter in the text area as the input automatically get converted to simple ASCII characters in the output.


2 Answers

Thank you all for some very useful answers. I realize the actual question isn't "How can I convert ANY Unicode character into its ASCII fallback" - the question is "how can I convert the Unicode characters my customers are complaining about into their ASCII fallbacks" ?

In other words - we don't need a general-purpose solution; we need a solution that'll work 99% of the time, for English-speaking customers pasting English-language content from Word and other websites into our application. To that end, I analyzed eight years' worth of messages sent through our system looking for characters that aren't representable in ASCII encoding, using this test:

///<summary>Determine whether the supplied character is 
///using ASCII encoding.</summary>
bool IsAscii(char inputChar) {
    var ascii = new ASCIIEncoding();
    var asciiChar = (char)(ascii.GetBytes(inputChar.ToString())[0]);
    return(asciiChar == inputChar);
}

I've then been through the resulting set of unrepresentable characters and manually assigned an appropriate replacement string. The whole lot is bundled up in an extension method, so you can call myString.Asciify() to convert your string into a reasonable ASCII-encoding approximation.

public static class StringExtensions {
    private static readonly Dictionary<char, string> Replacements = new Dictionary<char, string>();
    /// <summary>Returns the specified string with characters not representable in ASCII codepage 437 converted to a suitable representative equivalent.  Yes, this is lossy.</summary>
    /// <param name="s">A string.</param>
    /// <returns>The supplied string, with smart quotes, fractions, accents and punctuation marks 'normalized' to ASCII equivalents.</returns>
    /// <remarks>This method is lossy. It's a bit of a hack that we use to get clean ASCII text for sending to downlevel e-mail clients.</remarks>
    public static string Asciify(this string s) {
        return (String.Join(String.Empty, s.Select(c => Asciify(c)).ToArray()));
    }

    private static string Asciify(char x) {
        return Replacements.ContainsKey(x) ? (Replacements[x]) : (x.ToString());
    }

    static StringExtensions() {
        Replacements['’'] = "'"; // 75151 occurrences
        Replacements['–'] = "-"; // 23018 occurrences
        Replacements['‘'] = "'"; // 9783 occurrences
        Replacements['”'] = "\""; // 6938 occurrences
        Replacements['“'] = "\""; // 6165 occurrences
        Replacements['…'] = "..."; // 5547 occurrences
        Replacements['£'] = "GBP"; // 3993 occurrences
        Replacements['•'] = "*"; // 2371 occurrences
        Replacements[' '] = " "; // 1529 occurrences
        Replacements['é'] = "e"; // 878 occurrences
        Replacements['ï'] = "i"; // 328 occurrences
        Replacements['´'] = "'"; // 226 occurrences
        Replacements['—'] = "-"; // 133 occurrences
        Replacements['·'] = "*"; // 132 occurrences
        Replacements['„'] = "\""; // 102 occurrences
        Replacements['€'] = "EUR"; // 95 occurrences
        Replacements['®'] = "(R)"; // 91 occurrences
        Replacements['¹'] = "(1)"; // 80 occurrences
        Replacements['«'] = "\""; // 79 occurrences
        Replacements['è'] = "e"; // 79 occurrences
        Replacements['á'] = "a"; // 55 occurrences
        Replacements['™'] = "TM"; // 54 occurrences
        Replacements['»'] = "\""; // 52 occurrences
        Replacements['ç'] = "c"; // 52 occurrences
        Replacements['½'] = "1/2"; // 48 occurrences
        Replacements['­'] = "-"; // 39 occurrences
        Replacements['°'] = " degrees "; // 33 occurrences
        Replacements['ä'] = "a"; // 33 occurrences
        Replacements['É'] = "E"; // 31 occurrences
        Replacements['‚'] = ","; // 31 occurrences
        Replacements['ü'] = "u"; // 30 occurrences
        Replacements['í'] = "i"; // 28 occurrences
        Replacements['ë'] = "e"; // 26 occurrences
        Replacements['ö'] = "o"; // 19 occurrences
        Replacements['à'] = "a"; // 19 occurrences
        Replacements['¬'] = " "; // 17 occurrences
        Replacements['ó'] = "o"; // 15 occurrences
        Replacements['â'] = "a"; // 13 occurrences
        Replacements['ñ'] = "n"; // 13 occurrences
        Replacements['ô'] = "o"; // 10 occurrences
        Replacements['¨'] = ""; // 10 occurrences
        Replacements['å'] = "a"; // 8 occurrences
        Replacements['ã'] = "a"; // 8 occurrences
        Replacements['ˆ'] = ""; // 8 occurrences
        Replacements['©'] = "(c)"; // 6 occurrences
        Replacements['Ä'] = "A"; // 6 occurrences
        Replacements['Ï'] = "I"; // 5 occurrences
        Replacements['ò'] = "o"; // 5 occurrences
        Replacements['ê'] = "e"; // 5 occurrences
        Replacements['î'] = "i"; // 5 occurrences
        Replacements['Ü'] = "U"; // 5 occurrences
        Replacements['Á'] = "A"; // 5 occurrences
        Replacements['ß'] = "ss"; // 4 occurrences
        Replacements['¾'] = "3/4"; // 4 occurrences
        Replacements['È'] = "E"; // 4 occurrences
        Replacements['¼'] = "1/4"; // 3 occurrences
        Replacements['†'] = "+"; // 3 occurrences
        Replacements['³'] = "'"; // 3 occurrences
        Replacements['²'] = "'"; // 3 occurrences
        Replacements['Ø'] = "O"; // 2 occurrences
        Replacements['¸'] = ","; // 2 occurrences
        Replacements['Ë'] = "E"; // 2 occurrences
        Replacements['ú'] = "u"; // 2 occurrences
        Replacements['Ö'] = "O"; // 2 occurrences
        Replacements['û'] = "u"; // 2 occurrences
        Replacements['Ú'] = "U"; // 2 occurrences
        Replacements['Œ'] = "Oe"; // 2 occurrences
        Replacements['º'] = "?"; // 1 occurrences
        Replacements['‰'] = "0/00"; // 1 occurrences
        Replacements['Å'] = "A"; // 1 occurrences
        Replacements['ø'] = "o"; // 1 occurrences
        Replacements['˜'] = "~"; // 1 occurrences
        Replacements['æ'] = "ae"; // 1 occurrences
        Replacements['ù'] = "u"; // 1 occurrences
        Replacements['‹'] = "<"; // 1 occurrences
        Replacements['±'] = "+/-"; // 1 occurrences
    }
}

Note that there are some rather odd fallbacks in there - like this one:

Replacements['³'] = "'"; // 3 occurrences
Replacements['²'] = "'"; // 3 occurrences

That's because one of our users has some program that converts open/close smart-quotes into ² and ³ (like : he said ²hello³) and nobody has ever used them to represent exponentiation, so this will probably work quite nicely for us, but YMMV.

like image 163
Dylan Beattie Avatar answered Oct 19 '22 04:10

Dylan Beattie


I had some problems with this myself, whilst using a list of strings originally built in Word. I have found that using a simple "String".replace(current char/string, new char/string) command works perfectly. The exact code I used was for smart quotes, or to be exact: left ", right ", left ', and right ' is as follows:

StringName = StringName.Replace(ChrW(8216), "'")     ' Replaces any left ' with a normal '
StringName = StringName.Replace(ChrW(8217), "'")     ' Replaces any right ' with a normal '
StringName = StringName.Replace(ChrW(8220), """")    ' Replace any left " with a normal "
StringName = StringName.Replace(ChrW(8221), """")    ' Replace any right " with a normal "

I hope this helps anyone out there still having this problem!

like image 38
Paul Avatar answered Oct 19 '22 04:10

Paul