Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do browsers determine the encoding used?

Tags:

html

encoding

I do understand there are 2 ways to set the encoding:

  1. By using Content-Type header.
  2. By using meta tags in HTML

Since Content-Type header is not mandatory and is required to be set explicitly (the server side can set it if it wants) and meta tag is also optional.

In case both of these are not present, how does the browser determine the encoding used for parsing the content?

like image 617
Vivek Kumar Avatar asked Mar 31 '17 19:03

Vivek Kumar


People also ask

What encoding do browsers use?

Your browser will encode input, according to the character-set used in your page. The default character-set in HTML5 is UTF-8.

How we can indicate to the browser about the encoding to be used in the page?

Always declare the encoding of your document using a meta element with a charset attribute, or using the http-equiv and content attributes (called a pragma directive).

How do I know the encoding of Chrome?

Select "View" from the top of your browser window. Select "Text Encoding." Select "Unicode (UTF-8)" from the dropdown menu.

What encoding does Google use?

UTF-8, which is the default encoding type, works best in the vast majority of cases. In fact, many text issues in the results page can be resolved by keeping the UTF-8 value. The only time you need to change the encoding value for your results page and search box is when the hosting webpage is not in UTF-8.


1 Answers

They can guess it based on heuristic

I don't know how good are browsers today at encoding detection but MS Word did a very good job at it and recognizes even charsets I've never heard before. You can just open a *.txt file with random encoding and see.

This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform language detection.

https://en.wikipedia.org/wiki/Charset_detection

Firefox uses the Mozilla Charset Detectors. The way it works is explained here and you can also change its heuristic preferences. The Mozilla Charset Detectors were even forked to uchardet which works better and detects more languages

[Update: As commented below, it moved to chardetng since Firefox 73]

Chrome previously used ICU detector but switched to CED almost 2 years ago


None of the detection algorithms are perfect, they can guess it incorrectly like this, because it's just guessing anyway!

This process is not foolproof because it depends on statistical data.

so that's how the famous Bush hid the facts bug occurred. Bad guessing also introduces a vulnerability to the system

For all those skeptics out there, there is a very good reason why the character encoding should be explicitly stated. When the browser isn't told what the character encoding of a text is, it has to guess: and sometimes the guess is wrong. Hackers can manipulate this guess in order to slip XSS past filters and then fool the browser into executing it as active code. A great example of this is the Google UTF-7 exploit.

http://htmlpurifier.org/docs/enduser-utf8.html#fixcharset-none

As a result, the encoding should always be explicitly stated.

like image 130
phuclv Avatar answered Sep 28 '22 23:09

phuclv