Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does HTML5 specify a default character encoding for HTML documents if no character encoding is supplied?

Tags:

An example HTML document retrieved over HTTP lacks:

  • a HTTP Content-Type header
  • a HTML <meta charset="<character encoding>" />
  • a HTML <meta http-equiv='Content-Type' content='Type=text/html; charset=<character encoding>'>

With regards to HTML5, is a default, for example UTF-8, assumed as the character encoding? Or is it entirely up the application reading the HTML document to choose a default?

like image 895
Jon Cram Avatar asked Sep 13 '12 12:09

Jon Cram


People also ask

What is the default character encoding in HTML5?

The default character encoding for HTML5 is UTF-8.

Which HTML element is correct for setting the character encoding of a document?

Quick answer. Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and content attributes (called a pragma directive).

What is the default character encoding?

encoding attribute, Java uses “UTF-8” character encoding by default. Character encoding basically interprets a sequence of bytes into a string of specific characters. The same combination of bytes can denote different characters in different character encoding.

Which of the following HTML5 code indicates the character set used by any HTML5 document?

Meta tag is used to indicate the character set. meta tag is used to indicate char set.


1 Answers

The charset is determined using these rules:

  1. User override.
  2. An HTTP "charset" parameter in a "Content-Type" field.
  3. A Byte Order Mark before any other data in the HTML document itself.
  4. A META declaration with a "charset" attribute.
  5. A META declaration with an "http-equiv" attribute set to "Content-Type" and a value set for "charset".
  6. Unspecified heuristic analysis.

...and then...

  1. Normalize the given character encoding string according to the Charset Alias Matching rules defined in Unicode Technical Standard #22.
  2. Override some problematic encodings, i.e. intentionally treat some encodings as if they were different encodings. The most common override is treating US-ASCII and ISO-8859-1 as Windows-1252, but there are several other encoding overrides listed in this table. As the specification notes, "The requirement to treat certain encodings as other encodings according to the table above is a willful violation of the W3C Character Model specification."

But the most important thing is:

You should always specify a character encoding on every HTML document, or bad things will happen. You can do it the hard way (HTTP Content-Type header), the easy way (<meta http-equiv> declaration), or the new way (<meta charset> attribute), but please do it. The web thanks you.

Sources:

  • http://blog.whatwg.org/the-road-to-html-5-character-encoding
  • http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding
like image 166
ThiefMaster Avatar answered Sep 28 '22 09:09

ThiefMaster