Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prefer charset declaration in HTML meta tag or HTTP header?

I'm parsing a lot of sites. All works fine, I'm reading also charset declarations to convert encodings. Now I've a problem with http://celleheute.de/sonntagsfuhrung-3/.

The HTML meta tag says, that the content is encoded as ISO-8859-2, but the HTTP header says, it's UTF-8. And really, the content is UTF encoded, so when my parser tries to convert the content to ISO it will break some chars.

Now my question is, which declaration should I prefer? Should I ignore meta tags, when I can find the declaration in HTTP header or vice versa? What will most web browsers do?

like image 220
rabudde Avatar asked Aug 18 '11 05:08

rabudde


People also ask

Do you need meta charset HTML?

It is not necessary to include <meta charset="blah"> . As the specification says, the character set may also be specified by the server using the HTTP Content-Type header or by including a Unicode BOM at the beginning of the downloaded file.

Is charset attribute of meta tag?

Charset attribute in <meta> Tag: The charset attribute is present in the meta element. It specifies the character encoding for the HTML document.

Which one of the following options is the correct way of declaring character encoding in HTML5?

Quick answer. Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and content attributes (called a pragma directive).

What character set should your HTML documents be?

The HTML5 specification encourages web developers to use the UTF-8 character set! This has not always been the case. The character encoding for the early web was ASCII. Later, from HTML 2.0 to HTML 4.01, ISO-8859-1 was considered as the standard character set.


1 Answers

To understand what modern browsers do, you should start reading at http://w3c.github.io/html/syntax.html#determining-the-character-encoding

Steps one and two are most relevant to the question. They say

  1. If the user has explicitly instructed the user agent to override the document's character encoding with a specific encoding, optionally return that encoding with the confidence certain and abort these steps.

  2. If the transport layer specifies an encoding, and it is supported, return that encoding with the confidence certain, and abort these steps.

which means that the real HTTP header takes precedence over everything except user over-ride.

Beyond that it can get complex. A byte order mark, can for example, take precedence over the meta tag.


UPDATE: Since this answer was written, the spec changed (around mid-2012) so that the byte order mark now takes precedence over the HTTP header.

like image 195
Alohci Avatar answered Nov 14 '22 22:11

Alohci