Some of our users use e-mail clients that can't cope with Unicode, even when the encoding, etc. are properly set in the mail headers. I'd like to 'normalise' the content they're receiving. The biggest problem we have is users copy'n'pasting content from Microsoft Word into our web application, which then forwards that content by e-mail - including fractions, smart quotes, and all the other extended Unicode characters that Word helpfully inserts for you. I'm guessing there is no definitely solution for this, but before I sit down and start writing great big lookup tables, is there some built-in method that'll get me started? There's basically three phases involved. First, stripping accents from otherwise-normal letters - solution to this is here <pre class="prettyprint"><code>This paragraph contains “smart quotes” and áccénts and ½ of the problem is fractions </code></pre> goes to <pre class="prettyprint"><code>This paragraph contains “smart quotes” and accents and ½ of the problem is fractions </code></pre> Second, replacing single Unicode characters with their ASCII equivalent, to give: <pre class="prettyprint"><code>This paragraph contains "smart quotes" and accents and ½ of the problem is fractions </code></pre> This is the part where I'm hoping there's a solution before I implement my own. Finally, replacing specific characters with a suitable ASCII sequence - ½ to 1/2, and so on - which I'm pretty sure isn't natively supported by any kind of Unicode magic, but somebody might have written a suitable lookup table I can re-use. Any ideas?

I had some problems with this myself, whilst using a list of strings originally built in Word. I have found that using a simple <code>"String".replace(current char/string, new char/string)</code> command works perfectly. The exact code I used was for smart quotes, or to be exact: left ", right ", left ', and right ' is as follows: <pre class="prettyprint"><code>StringName = StringName.Replace(ChrW(8216), "'") ' Replaces any left ' with a normal ' StringName = StringName.Replace(ChrW(8217), "'") ' Replaces any right ' with a normal ' StringName = StringName.Replace(ChrW(8220), """") ' Replace any left " with a normal " StringName = StringName.Replace(ChrW(8221), """") ' Replace any right " with a normal " </code></pre> I hope this helps anyone out there still having this problem!

Can .NET convert Unicode to ASCII to remove "smart quotes", etc?

Tags:

.net

unicode

ascii

normalize

codepages

Some of our users use e-mail clients that can't cope with Unicode, even when the encoding, etc. are properly set in the mail headers.

I'd like to 'normalise' the content they're receiving. The biggest problem we have is users copy'n'pasting content from Microsoft Word into our web application, which then forwards that content by e-mail - including fractions, smart quotes, and all the other extended Unicode characters that Word helpfully inserts for you.

I'm guessing there is no definitely solution for this, but before I sit down and start writing great big lookup tables, is there some built-in method that'll get me started?

There's basically three phases involved.

First, stripping accents from otherwise-normal letters - solution to this is here

This paragraph contains “smart quotes” and áccénts and ½ of the problem is fractions

goes to

This paragraph contains “smart quotes” and accents and ½ of the problem is fractions

Second, replacing single Unicode characters with their ASCII equivalent, to give:

This paragraph contains "smart quotes" and accents and ½ of the problem is fractions

This is the part where I'm hoping there's a solution before I implement my own. Finally, replacing specific characters with a suitable ASCII sequence - ½ to 1/2, and so on - which I'm pretty sure isn't natively supported by any kind of Unicode magic, but somebody might have written a suitable lookup table I can re-use.

Any ideas?

620

asked May 28 '11 18:05

Dylan Beattie

2 Answers

Thank you all for some very useful answers. I realize the actual question isn't "How can I convert ANY Unicode character into its ASCII fallback" - the question is "how can I convert the Unicode characters my customers are complaining about into their ASCII fallbacks" ?

In other words - we don't need a general-purpose solution; we need a solution that'll work 99% of the time, for English-speaking customers pasting English-language content from Word and other websites into our application. To that end, I analyzed eight years' worth of messages sent through our system looking for characters that aren't representable in ASCII encoding, using this test:

///<summary>Determine whether the supplied character is 
///using ASCII encoding.</summary>
bool IsAscii(char inputChar) {
    var ascii = new ASCIIEncoding();
    var asciiChar = (char)(ascii.GetBytes(inputChar.ToString())[0]);
    return(asciiChar == inputChar);
}

I've then been through the resulting set of unrepresentable characters and manually assigned an appropriate replacement string. The whole lot is bundled up in an extension method, so you can call myString.Asciify() to convert your string into a reasonable ASCII-encoding approximation.

public static class StringExtensions {
    private static readonly Dictionary<char, string> Replacements = new Dictionary<char, string>();
    /// <summary>Returns the specified string with characters not representable in ASCII codepage 437 converted to a suitable representative equivalent.  Yes, this is lossy.</summary>
    /// <param name="s">A string.</param>
    /// <returns>The supplied string, with smart quotes, fractions, accents and punctuation marks 'normalized' to ASCII equivalents.</returns>
    /// <remarks>This method is lossy. It's a bit of a hack that we use to get clean ASCII text for sending to downlevel e-mail clients.</remarks>
    public static string Asciify(this string s) {
        return (String.Join(String.Empty, s.Select(c => Asciify(c)).ToArray()));
    }

    private static string Asciify(char x) {
        return Replacements.ContainsKey(x) ? (Replacements[x]) : (x.ToString());
    }

    static StringExtensions() {
        Replacements['’'] = "'"; // 75151 occurrences
        Replacements['–'] = "-"; // 23018 occurrences
        Replacements['‘'] = "'"; // 9783 occurrences
        Replacements['”'] = "\""; // 6938 occurrences
        Replacements['“'] = "\""; // 6165 occurrences
        Replacements['…'] = "..."; // 5547 occurrences
        Replacements['£'] = "GBP"; // 3993 occurrences
        Replacements['•'] = "*"; // 2371 occurrences
        Replacements[' '] = " "; // 1529 occurrences
        Replacements['é'] = "e"; // 878 occurrences
        Replacements['ï'] = "i"; // 328 occurrences
        Replacements['´'] = "'"; // 226 occurrences
        Replacements['—'] = "-"; // 133 occurrences
        Replacements['·'] = "*"; // 132 occurrences
        Replacements['„'] = "\""; // 102 occurrences
        Replacements['€'] = "EUR"; // 95 occurrences
        Replacements['®'] = "(R)"; // 91 occurrences
        Replacements['¹'] = "(1)"; // 80 occurrences
        Replacements['«'] = "\""; // 79 occurrences
        Replacements['è'] = "e"; // 79 occurrences
        Replacements['á'] = "a"; // 55 occurrences
        Replacements['™'] = "TM"; // 54 occurrences
        Replacements['»'] = "\""; // 52 occurrences
        Replacements['ç'] = "c"; // 52 occurrences
        Replacements['½'] = "1/2"; // 48 occurrences
        Replacements[''] = "-"; // 39 occurrences
        Replacements['°'] = " degrees "; // 33 occurrences
        Replacements['ä'] = "a"; // 33 occurrences
        Replacements['É'] = "E"; // 31 occurrences
        Replacements['‚'] = ","; // 31 occurrences
        Replacements['ü'] = "u"; // 30 occurrences
        Replacements['í'] = "i"; // 28 occurrences
        Replacements['ë'] = "e"; // 26 occurrences
        Replacements['ö'] = "o"; // 19 occurrences
        Replacements['à'] = "a"; // 19 occurrences
        Replacements['¬'] = " "; // 17 occurrences
        Replacements['ó'] = "o"; // 15 occurrences
        Replacements['â'] = "a"; // 13 occurrences
        Replacements['ñ'] = "n"; // 13 occurrences
        Replacements['ô'] = "o"; // 10 occurrences
        Replacements['¨'] = ""; // 10 occurrences
        Replacements['å'] = "a"; // 8 occurrences
        Replacements['ã'] = "a"; // 8 occurrences
        Replacements['ˆ'] = ""; // 8 occurrences
        Replacements['©'] = "(c)"; // 6 occurrences
        Replacements['Ä'] = "A"; // 6 occurrences
        Replacements['Ï'] = "I"; // 5 occurrences
        Replacements['ò'] = "o"; // 5 occurrences
        Replacements['ê'] = "e"; // 5 occurrences
        Replacements['î'] = "i"; // 5 occurrences
        Replacements['Ü'] = "U"; // 5 occurrences
        Replacements['Á'] = "A"; // 5 occurrences
        Replacements['ß'] = "ss"; // 4 occurrences
        Replacements['¾'] = "3/4"; // 4 occurrences
        Replacements['È'] = "E"; // 4 occurrences
        Replacements['¼'] = "1/4"; // 3 occurrences
        Replacements['†'] = "+"; // 3 occurrences
        Replacements['³'] = "'"; // 3 occurrences
        Replacements['²'] = "'"; // 3 occurrences
        Replacements['Ø'] = "O"; // 2 occurrences
        Replacements['¸'] = ","; // 2 occurrences
        Replacements['Ë'] = "E"; // 2 occurrences
        Replacements['ú'] = "u"; // 2 occurrences
        Replacements['Ö'] = "O"; // 2 occurrences
        Replacements['û'] = "u"; // 2 occurrences
        Replacements['Ú'] = "U"; // 2 occurrences
        Replacements['Œ'] = "Oe"; // 2 occurrences
        Replacements['º'] = "?"; // 1 occurrences
        Replacements['‰'] = "0/00"; // 1 occurrences
        Replacements['Å'] = "A"; // 1 occurrences
        Replacements['ø'] = "o"; // 1 occurrences
        Replacements['˜'] = "~"; // 1 occurrences
        Replacements['æ'] = "ae"; // 1 occurrences
        Replacements['ù'] = "u"; // 1 occurrences
        Replacements['‹'] = "<"; // 1 occurrences
        Replacements['±'] = "+/-"; // 1 occurrences
    }
}

Note that there are some rather odd fallbacks in there - like this one:

Replacements['³'] = "'"; // 3 occurrences
Replacements['²'] = "'"; // 3 occurrences

That's because one of our users has some program that converts open/close smart-quotes into ² and ³ (like : he said ²hello³) and nobody has ever used them to represent exponentiation, so this will probably work quite nicely for us, but YMMV.

163

answered Oct 19 '22 04:10

Dylan Beattie

I had some problems with this myself, whilst using a list of strings originally built in Word. I have found that using a simple "String".replace(current char/string, new char/string) command works perfectly. The exact code I used was for smart quotes, or to be exact: left ", right ", left ', and right ' is as follows:

StringName = StringName.Replace(ChrW(8216), "'")     ' Replaces any left ' with a normal '
StringName = StringName.Replace(ChrW(8217), "'")     ' Replaces any right ' with a normal '
StringName = StringName.Replace(ChrW(8220), """")    ' Replace any left " with a normal "
StringName = StringName.Replace(ChrW(8221), """")    ' Replace any right " with a normal "

I hope this helps anyone out there still having this problem!

answered Oct 19 '22 04:10

Paul

Related questions
                            
                                Is there a .NET ready made method to process response body of a HttpListener HttpListenerRequest body?
                            
                                Difference between [...]Async and Begin[...] .net asynchronous APIs
                            
                                What is concrete implementation?
                            
                                Using a lambda expression versus a private method
                            
                                How do I get previous key from SortedDictionary?
                            
                                Call Python from .NET
                            
                                FindByIdentity - performance differences
                            
                                Get dependent assemblies?
                            
                                DateTime - Strange daylight savings behaviour
                            
                                Why does iterating over GetConsumingEnumerable() not fully empty the underlying blocking collection
                            
                                Potential pitfalls with static constructors in C#
                            
                                IEqualityComparer and Contains method
                            
                                Where is GAC (Global Assembly Cache) located? How it is Useful? [duplicate]
                            
                                AppendChild() is not a function javascript
                            
                                Debugging .NET code with MS's Symbol Server - VS does not show variable's values
                            
                                What is strong naming and how do I strong name a binary?
                            
                                Why use Microsoft AntiXSS library?
                            
                                ASP.NET MVC redirect from attribute
                            
                                AES Encryption and C#
                            
                                How to sort an UltraGrid by multiple columns programmatically?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can .NET convert Unicode to ASCII to remove "smart quotes", etc?

Tags:

.net

unicode

ascii

normalize

codepages

Dylan Beattie

People also ask

2 Answers

Dylan Beattie

Paul

Recent Activity

Donate For Us