Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

uint and char casting for unicode character code

Tags:

c#

.net

unicode

Can someone explain exactly what is going on with this code:

var letter= 'J';
char c = (char)(0x000000ff & (uint)letter);

I understand it is getting the unicode representation of the character, however I don't fully understand the role of:

(0x000000ff & (uint)letter

What is the purpose of 0x000000ff and the casting of the letter to (uint) and is there a short hand way of achieving the same result?

Thanks

Update

OK, looks like most people think this is a bad example, I didn't want to include the whole class but I suppose I might as well so you can see the context. From Reference Source's WebHeaderCollection:

  private static string CheckBadChars(string name, bool isHeaderValue)
    {
        if (name == null || name.Length == 0)
        {
            // emtpy name is invlaid
            if (!isHeaderValue)
            {
                throw name == null ? 
                    new ArgumentNullException("name") :
                    new ArgumentException(SR.GetString(SR.WebHeaderEmptyStringCall, "name"), "name");
            }

            // empty value is OK
            return string.Empty;
        }

        if (isHeaderValue)
        {
            // VALUE check
            // Trim spaces from both ends
            name = name.Trim(HttpTrimCharacters);

            // First, check for correctly formed multi-line value
            // Second, check for absenece of CTL characters
            int crlf = 0;
            for (int i = 0; i < name.Length; ++i)
            {
                char c = (char)(0x000000ff & (uint)name[i]);
                switch (crlf)
                {
                    case 0:
                        if (c == '\r')
                        {
                            crlf = 1;
                        }
                        else if (c == '\n')
                        {
                            // Technically this is bad HTTP.  But it would be a breaking change to throw here.
                            // Is there an exploit?
                            crlf = 2;
                        }
                        else if (c == 127 || (c < ' ' && c != '\t'))
                        {
                            throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidControlChars), "value");
                        }

                        break;

                    case 1:
                        if (c == '\n')
                        {
                            crlf = 2;
                            break;
                        }

                        throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidCRLFChars), "value");

                    case 2:
                        if (c == ' ' || c == '\t')
                        {
                            crlf = 0;
                            break;
                        }

                        throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidCRLFChars), "value");
                }
            }

            if (crlf != 0)
            {
                throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidCRLFChars), "value");
            }
        }
        else
        {
            // NAME check
            // First, check for absence of separators and spaces
            if (name.IndexOfAny(InvalidParamChars) != -1)
            {
                throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidHeaderChars), "name");
            }

            // Second, check for non CTL ASCII-7 characters (32-126)
            if (ContainsNonAsciiChars(name))
            {
                throw new ArgumentException(SR.GetString(SR.WebHeaderInvalidNonAsciiChars), "name");
            }
        }

        return name;
    }

The bit of interest is:

char c = (char)(0x000000ff & (uint)name[i]);
like image 975
Imran Azad Avatar asked Feb 12 '26 02:02

Imran Azad


2 Answers

You're parsing HTTP headers, right? That means you shouldn't be using (any) unicode encoding.

HTTP headers must be 7-bit ASCII (unlike the request data)1. That means that you should be using the ASCII encoding instead of the default. So while you are parsing the request bytes, you have to use Encoding.ASCII.GetString instead of Encoding.Default.GetString. Hopefully, you're not using StreamReader - that would be a bad idea for quite a few reasons, including the (likely) encoding mismatch between the headers and the content of the request.

EDIT:

As for the use in Microsoft source-code - yeah, it happens. Don't try to copy those kinds of things over - it is a hack. Remember, you don't have the test suites and quality assurance Microsoft engineers have, so even if it does in fact work, you're better off not copying such hacks.

I assume that it's handled this way because of the use of string for something that in principle should be either "ASCII string" or just byte[] - since .NET only supports unicode strings, this was seen as the lesser evil (indeed, that's why the code explicitly checks that the string doesn't contain any unicode characters - it's well aware that the headers must be ASCII - it will fail explicitly if the string has any non-ASCII characters. It's just the usual tradeoff when writing high-performance frameworks for other people to build on.

Footnotes:

  1. Actually, the RFC (2616) specifies US-ASCII as the encoding, probably meaning ISO-8859-1. However, RFC is not a binding standard (it's more like a hope of making order out of chaos :D), and there's plenty of HTTP/1.0 and HTTP/1.1 clients (and servers) around that do not in fact respect this. Like the .NET authors, I'd stick with the 7-bit ASCII (encoded char-per-byte, of course, not real 7-bit).
like image 63
Luaan Avatar answered Feb 14 '26 16:02

Luaan


What is the purpose of 0x000000ff and the casting of the letter to (uint)

to get character with code from [0..255] range: char takes 2 bytes in memory

e.g.:

var letter= (char)4200; // ၩ
char c = (char)(0x000000ff & (uint)letter); // h

// or
// char c = (char)(0x00ff & (ushort)letter);

// ushort (2-byte unsigned integer) is enough: uint is 4-byte unsigned integer
like image 32
ASh Avatar answered Feb 14 '26 16:02

ASh



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!