I am working on writing some code to scrub user input to my ASP.NET site. I need to scrub input to remove all references to ASCII characters 145, 146, 147, 148 which are occasionally getting input from my mac users who are copying and pasting content they write in a word processor on their macs. My issue is the following three strings I am led to believe should output the same text. <pre class="prettyprint"><code>string test1 = Convert.ToChar(147).ToString(); string test2 = String.Format("'{0}'", Convert.ToChar(147)); char[] characters = System.Text.Encoding.ASCII.GetChars(new byte[] { 147 }); string test3 = new string(characters); </code></pre> Yet when I set an ASP TextBox to equal the following <pre class="prettyprint"><code>txtShowValues.Text = test1 + "*" + test2 + "*" + test3; </code></pre> I get a blank value for test1, test2 works correctly, and test3 outputs as a '?'. Can someone explain what is happening differently. I am hoping this will help me understand how .NET is using ASCII values for characters over 128 so that I can write a good scrubbing script. EDIT The values I mentioned (145 - 148) are curly quotes. So single left, single right, double left, double right. By "works correctly" I mean it outputs a curly quote to my browser. SECOND EDIT The following code (mentioned in an answer) outputs the curly quotes as well. So maybe the problem was using ASCII in test 3. <pre class="prettyprint"><code>char[] characters2 = System.Text.Encoding.Default.GetChars(new byte[] { 147 }); string test4 = new string(characters2); </code></pre> THIRD EDIT I found a mac that I could borrow and was able to duplicate the problem. When I copy and paste text that has quote symbols in them from Word into my web app on the mac it pastes curly quotes (147 and 148). When I hit save curly quotes are saved to the database, so I will use the code you all helped me with to scrub that content. FOUTH EDIT Spent some time writing more sample code based on the responses here and noticed it has something to do with MultiLine TextBoxes in ASP.NET. There was good info here, so I decided to just start a new question: ASP.NET Multiline textbox allowing input above UTF-8

Character 147 is U+0093 SET TRANSMIT STATE. Like all the Unicode characters in the range 0-255, it is the same as the ISO-8859-1 character of the same number. ISO-8859-1 assigns 147 to this invisible control code. What you are thinking of is not ‘ASCII’ or even ‘ISO-8859-1’, but Windows code page 1252. This is a non-standard encoding that is like 8859-1, but assigns the characters 128-159 to various typographical extensions such as smart quotes instead of the largely-useless control codes. In code page 1252, character 147 is <code>“</code>, aka U+201C LEFT DOUBLE QUOTATION MARK. If you want to convert Windows code pages (often misleadingly known as ‘ANSI’) to Unicode characters you will need to specify the code page you want, for example: <pre class="prettyprint"><code>System.Text.Encoding.getEncoding(1252).GetChars(new byte[] { 147 }) </code></pre> <code>System.Text.Encoding.Default</code> will give you the default encoding on your server. For a server in the Western European locale, that'll be 1252. Elsewhere, it won't be. It's generally not a good idea to have a dependency on the locale's default code page in a server application. In any case, you should not be getting bytes like 147 representing a <code>“</code> in the input to a web application. That will only happen if your page itself is in code page 1252 encoding (and just to confuse and mislead even more, when you say your page is in ISO-8859-1 format, browsers will silently use code page 1252 instead). Your page may also be in 1252 if you've failed to specify any encoding for it (the browser guesses; other locales will guess different code pages so it'll all be a big mess). Make sure you use UTF-8 for all encodings in your web app, and mark your pages as such. Today, all web apps should be using UTF-8.

.NET uses unicode (UCS-2) which is the same as ASCII only for values below 128. ASCII doesn't define values above 127. I think you may be thinking of ANSI, which defines values above 127 as (mostly) language characters needed for most European languages. or OEM (the original IBM pc character set) which defines characters > 127 as (mostly) symbols. The difference in how the characters above 127 are interpreted is called a code page, or an encoding. (hence System.Text.Encoding). So you could probably get test 3 working if you used a different encoding, perhaps <code>System.Text.Encoding.Default</code>. Edit: Ok, now that we know that the encoding you want is ANSI, it's clearer what is happening. The rule for character conversions is to replace characters that can't be represented in encoding as some other character - usually a box. But for ASCII, there is no box character, so it uses a ? instead. This explains test 3. test1 and 2 are both using Convert.ToChar with an integer constant. Which will interpret the input as a UNICODE character, not an ANSI character, so no conversion is being applied. Unicode character 147 is a non-printing character.

manually converting between ASCII and .NET characters

Tags:

.net

character-encoding

asp.net

ascii

I am working on writing some code to scrub user input to my ASP.NET site. I need to scrub input to remove all references to ASCII characters 145, 146, 147, 148 which are occasionally getting input from my mac users who are copying and pasting content they write in a word processor on their macs.

My issue is the following three strings I am led to believe should output the same text.

string test1 = Convert.ToChar(147).ToString();
string test2 = String.Format("'{0}'", Convert.ToChar(147));

char[] characters = System.Text.Encoding.ASCII.GetChars(new byte[] { 147 });
string test3 = new string(characters);

Yet when I set an ASP TextBox to equal the following

txtShowValues.Text = test1 + "*" + test2 + "*" + test3;

I get a blank value for test1, test2 works correctly, and test3 outputs as a '?'.

Can someone explain what is happening differently. I am hoping this will help me understand how .NET is using ASCII values for characters over 128 so that I can write a good scrubbing script.

EDIT
The values I mentioned (145 - 148) are curly quotes. So single left, single right, double left, double right.

By "works correctly" I mean it outputs a curly quote to my browser.

SECOND EDIT
The following code (mentioned in an answer) outputs the curly quotes as well. So maybe the problem was using ASCII in test 3.

char[] characters2 = System.Text.Encoding.Default.GetChars(new byte[] { 147 });
string test4 = new string(characters2);

THIRD EDIT
I found a mac that I could borrow and was able to duplicate the problem. When I copy and paste text that has quote symbols in them from Word into my web app on the mac it pastes curly quotes (147 and 148). When I hit save curly quotes are saved to the database, so I will use the code you all helped me with to scrub that content.

FOUTH EDIT
Spent some time writing more sample code based on the responses here and noticed it has something to do with MultiLine TextBoxes in ASP.NET. There was good info here, so I decided to just start a new question: ASP.NET Multiline textbox allowing input above UTF-8

976

asked Feb 05 '10 19:02

Justin C

2 Answers

Character 147 is U+0093 SET TRANSMIT STATE. Like all the Unicode characters in the range 0-255, it is the same as the ISO-8859-1 character of the same number. ISO-8859-1 assigns 147 to this invisible control code.

What you are thinking of is not ‘ASCII’ or even ‘ISO-8859-1’, but Windows code page 1252. This is a non-standard encoding that is like 8859-1, but assigns the characters 128-159 to various typographical extensions such as smart quotes instead of the largely-useless control codes. In code page 1252, character 147 is “, aka U+201C LEFT DOUBLE QUOTATION MARK.

If you want to convert Windows code pages (often misleadingly known as ‘ANSI’) to Unicode characters you will need to specify the code page you want, for example:

System.Text.Encoding.getEncoding(1252).GetChars(new byte[] { 147 })

System.Text.Encoding.Default will give you the default encoding on your server. For a server in the Western European locale, that'll be 1252. Elsewhere, it won't be. It's generally not a good idea to have a dependency on the locale's default code page in a server application.

In any case, you should not be getting bytes like 147 representing a “ in the input to a web application. That will only happen if your page itself is in code page 1252 encoding (and just to confuse and mislead even more, when you say your page is in ISO-8859-1 format, browsers will silently use code page 1252 instead). Your page may also be in 1252 if you've failed to specify any encoding for it (the browser guesses; other locales will guess different code pages so it'll all be a big mess).

Make sure you use UTF-8 for all encodings in your web app, and mark your pages as such. Today, all web apps should be using UTF-8.

128

answered Sep 20 '22 12:09

bobince

.NET uses unicode (UCS-2) which is the same as ASCII only for values below 128.

ASCII doesn't define values above 127.

I think you may be thinking of ANSI, which defines values above 127 as (mostly) language characters needed for most European languages. or OEM (the original IBM pc character set) which defines characters > 127 as (mostly) symbols.

The difference in how the characters above 127 are interpreted is called a code page, or an encoding. (hence System.Text.Encoding). So you could probably get test 3 working if you used a different encoding, perhaps System.Text.Encoding.Default.

Edit: Ok, now that we know that the encoding you want is ANSI, it's clearer what is happening.

The rule for character conversions is to replace characters that can't be represented in encoding as some other character - usually a box. But for ASCII, there is no box character, so it uses a ? instead. This explains test 3.

test1 and 2 are both using Convert.ToChar with an integer constant. Which will interpret the input as a UNICODE character, not an ANSI character, so no conversion is being applied. Unicode character 147 is a non-printing character.

answered Sep 18 '22 12:09

John Knoeller

Related questions
                            
                                Why is there no Microsoft.Win64 Namespace?
                            
                                C# String representation of method
                            
                                Implementing transactions over multiple databases
                            
                                Sizes of structs on 32 bit and 64 bit
                            
                                .NET HTML whitelisting (anti-xss/Cross Site Scripting)
                            
                                Why I cannot inherit LinkedListNode<T>?
                            
                                Debugging ClickOnce deployment functions
                            
                                Disposing Brushes
                            
                                LINQ to Entities: Why can't I use Split method as condition?
                            
                                How to Handle Actual Time with Durations in C#?
                            
                                Dump C# DataTable to a file
                            
                                Do I need to call Graphics.Dispose()?
                            
                                Calling unmanaged function from C#: should I pass StringBuilder or use unsafe code?
                            
                                Any native ZIP/Packaging for .NET3.5
                            
                                Proving SQL Injection
                            
                                How to play audio from resource
                            
                                Are there Func objects with more than 4 parameters?
                            
                                SQL Server RowVersion
                            
                                How to mock System.DirectoryServices.SearchResult?
                            
                                Update UI from multiple worker threads (.NET)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With