Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is a unicode user agent legal inside an HTTP header?

An application I'm maintaining loads user agents extracted from web logs into a MySQL table column using the 'latin1' charset. Occasionally, it fails to load a user agent that looks like this:

Mozilla/5.0 (Iâ?; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML^C like Gecko) Version

I suspect it's choking on Iâ?. I'm working to figure out if this should be supported, or if it's corruption introduced by the upstream logging system. Is this a legal user agent in a HTTP header?

like image 502
AndreiM Avatar asked Apr 30 '12 13:04

AndreiM


People also ask

Is user agent part of HTTP header?

When your browser is connected to a website, a User-Agent field is included in the HTTP header. The data of the header field varies from browser to browser. This information is used to serve different websites to different web browsers and different operating systems.

Can user agent be spoofed?

In fact, many will employ a user agent spoofing Chrome extension or plugin to help them adjust their UAS on the fly – a popular method of testing websites or browser compatibility. Some marketers may also use user agent spoofing to see how their ads, for example display campaigns, are showing on different browsers.

What is user agent in email header?

A Mail User Agent (MUA), also referred to as an email client, is a computer application that allows you to send and retrieve email. A MUA is what you interact with, as opposed to an email server, which transports email.

Does Chrome use AppleWebKit?

Chrome is using Apple WebKit engine to render HTML, but in order to avoid those websites show recommendation for Internet Explorer, added "Like Gecko" to it's useragent.


2 Answers

RFC 2616 (HTTP 1.1) says that message header contents must be "consisting of either *TEXT or combinations of token, separators, and quoted-string". If you look at the definitions for TEXT etc you will find that legal characters are those with byte values not in the [0, 31] range and not equal to 127; therefore characters such as â are as far as I can tell legal as per the spec.

like image 109
Jon Avatar answered Oct 13 '22 08:10

Jon


Technically, octets > 127 are allowed in comments. RFC 2616 makes them default to ISO-8859-1, but HTTPbis (the upcoming revision of RFC 2616) has removed that rule so that sometimes in the distant future, we may be able to move to a sane encoding.

Recommendation: strip all octets > 127.

like image 31
Julian Reschke Avatar answered Oct 13 '22 09:10

Julian Reschke