Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding scheme used for cookies

RFC 6265 Sec 6.1 specifies allowing atleast 4096 bytes per cookie.

Now in order to know the number of characters allowed per cookie, I need to know the character encoding being used for cookies, as the RFC specifies the maximum size per cookie in terms of bytes and not characters.

How do I know the encoding being used to store cookies?

Is it determined by the character encoding used by the programming language used to create cookies (e.g PHP, JavaScript) or the character encoding being used by the browser storing cookies?

Update:

I conducted a few tests, and it appears that FF, Chrome and Opera seem to be using UTF-8 for cookie storage, and the encoding obviously affects the number of characters you could store in a cookie. The maximum number of characters allowed in a cookie would be affected by the character encoding being used to store cookies on a client.

Suspecting the browsers are using UTF-8 as the character encoding for cookies, I used the tests here with a single-byte UTF-8 character (1), two-byte UTF-8 character (£), a 3-byte UTF-8 character (), and a 4-byte UTF-8 character (𝆏). I've pasted the results obtained below.

Every cookie set used a single-byte cookie name, and the number of characters mentioned does not include the single-byte character for the cookie name and the character = used to separate cookie name and coookie value. The value in [] beside each Unicode character denotes its hex representation in UTF-8.

FF 31.0

Firefox relaxes the RFC limit by a byte and puts a limit of 4097 bytes per cookie.

  1. 1-byte character (1, [0x31]) -- 4095 characters
  2. 2-byte character (£, [0xC2, 0xA3]) -- 2047 characters
  3. 3-byte character (, [0xE7, 0x95, 0x80]) -- 1365 characters
  4. 4-byte character (𝆏, [0xF0, 0x9D, 0x86, 0x8F]) -- 1023 characters

Chrome 36.0.1985.143

  1. 1-byte character (1, [0x31]) -- 4094 characters
  2. 2-byte character (£, [0xC2, 0xA3]) -- 2047 characters
  3. 3-byte character (, [0xE7, 0x95, 0x80]) -- 1364 characters
  4. 4-byte character (𝆏, [0xF0, 0x9D, 0x86, 0x8F]) -- 1023 characters

Opera 24.0.1558.17

  1. 1-byte character (1, [0x31]) -- 4094 characters
  2. 2-byte character (£, [0xC2, 0xA3]) -- 2047 characters
  3. 3-byte character (, [0xE7, 0x95, 0x80]) -- 1364 characters
  4. 4-byte character (𝆏, [0xF0, 0x9D, 0x86, 0x8F]) -- 1023 characters

IE 8.0.6001.19518

IE too relaxes the RFC limit to 5117 bytes per cookie, but also enforces a maximum cookies' size per domain limit (in this case, the limit found was 10234 characters)

  1. 1-byte character (1, [0x31]) -- 5115 characters
  2. 2-byte character (£, [0xC2, 0xA3]) -- 5115 characters
  3. 3-byte character (, [0xE7, 0x95, 0x80]) -- 5115 characters
  4. 4-byte character (𝆏, [0xF0, 0x9D, 0x86, 0x8F]) -- 2557 characters

Note on IE:

IE seems to be using the ECMAScript's notion of characters. ECMAScript exposes characters as 16-bit unsigned integers (character encoding could be either UTF-16 or UCS-2 and is left as an implementation choice). The 4-byte character chosen for the tests uses two 16-bit code units in UTF-16. And since ECMAScript counts a 16-bit integer as a characer, "𝆏".length === 2 returns true. This leads 𝆏 to be counted as two characters.

like image 802
Bharat Khatri Avatar asked Sep 04 '14 12:09

Bharat Khatri


2 Answers

It seems it is determined more by the programmer (behind the browser) than by the programming language. Usually cookies values are URL-encoded but there is no requirement.
Have a look at this answer that complete your study (adding the Safari special case). This one might help too.

like image 75
n0p Avatar answered Oct 14 '22 16:10

n0p


No matter how the cookies are stored internally by the browser, they eventually have to be transferred within the Set-Cookie and Cookies HTTP Header fields. It is the encoded length of those fields that the authors of the RFC most probably have in mind. At least in most RFCs that would be the case, so why not assume it here. Consequently, "the size of a cookie" depends on the way it will be encoded within an HTTP header.

According to the standard, request header fields should be

the OCTETs making up the field-value and consisting of either *TEXT or combinations of token, separators, and quoted-string

where *TEXT, in turn:

MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047.

RFC2047 defines what is known as "MIME encoding" and, as I read it, has some funny rules. Namely, according to its rules in order to encode a foreign charset you will either have to use a "quoted-printable" format: =?UTF-8?Q?=48=65=6c=6c=6f?=, or a "Base64 format: =?UTF-8?B?SGVsbG8=?=. (Note that both examples here encode the word "Hello". The first uses 27 bytes, the second uses 20, however this does not include the cookie name and attributes).

Moreover, according to RFC2047 you may not have "encoded words" longer than 76 characters, hence, if I understand things correctly, your longer cookie values will have to be encoded as a bunch of 76-byte pieces, each piece starting with the =?UTF-8?Q?= mumbo-jumbo.

I tested what would happen if I set a non-ASCII (Russian language) cookie using PHP via Apache. The resulting Set-Cookie header had no charset specification, used URL-encoding and was longer than 76 bytes (so much for the standards, right?):

CookieName=%D0%92+%D0...%B0%D0%B9; expires=Thu, 11-Sep-2014 19:59:18 GMT; path=/tmp/; domain=.some.domain.

The total length of a cookie value (with attributes), corresponding to an otherwise 176-character sentence was 923 bytes.

To summarize, I don't think you can get a strict answer to your question, but it's a fun question none the less.

like image 31
KT. Avatar answered Oct 14 '22 16:10

KT.