Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the 'lang' attribute of the <html> tag used for?

Tags:

html

In HTML, it's good to have a lang attribute in <html>, e.g. <html lang="en">.

How is this useful?

If this is used for translation, even if the language is set to English and there are all Chinese text in the document Google Translate detects it as Chinese, not English (this means Google ignores the lang attribute).

like image 608
Santosh Kumar Avatar asked Feb 01 '13 15:02

Santosh Kumar


People also ask

What is the lang attribute used for in HTML?

The lang (or sometimes the xml:lang ) attribute specifies the natural language of the content of a web page. An attribute on the html tag sets the language for all the text on the page.

How add lang attribute in HTML?

Always add a lang attribute to the html tag to set the default language of your page. If this is XHTML 1. x or an HTML5 polyglot document served as XML, you should also use the xml:lang attribute (with the same value). If your page is only served as XML, just use the xml:lang attribute.

Do I need Lang in HTML?

Whether or not you use the HTTP header, you should always declare the language of the text in a page using a language attribute on the html tag.

What is the use of Lang and title attribute in HTML explain with example?

You should always include the lang attribute inside the <html> tag, to declare the language of the Web page. This is meant to assist search engines and browsers. Country codes can also be added to the language code in the lang attribute.


3 Answers

I am quoting this from W3C:

Declaring language in HTML

Always use a language attribute on the html tag to declare the default language of the text in the page. When the page contains content in another language, add a language attribute to an element surrounding that content.

Use the lang attribute for pages served as HTML, and the xml:lang attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together.

Use language tags from the IANA Language Subtag Registry.

Also a good read is Why use the language attribute?.

like image 112
NullPoiиteя Avatar answered Oct 17 '22 15:10

NullPoiиteя


You asked "how is this useful".

"The <lang=> attribute can be used to declare the language of a Web page or a portion of a Web page. This is meant to assist search engine spiders, page formatting and screen reader technology"

Source: http://symbolcodes.tlt.psu.edu/web/tips/langtag.html (Wayback Machine link)

No mention of translation - but often a search engine spider will not want to parse through a document "in the wrong language" - its index file will grow (lots of new words), and the results will not be useful to the user (who cannot read the language, and who is using the wrong search terms).

The advent of smart translation technology (like Google's, referred to above) means that some search engines can see a page in one language, translate it, and figure out that someone searching for "cow" may be interested in this page that mentions "vache" and has <lang="fr">.

like image 11
Floris Avatar answered Oct 17 '22 16:10

Floris


The lang attribute is needed by screen readers to let them pronounce words correctly, and also (perhaps surprisingly) sometimes needed to allow text to be rendered correctly by the browser.

lang needed for speech synthesis

Some blind or visually impaired people use speech-synthesizing screen readers that speak the words on the screen. Since two words from different languages that are spelt identically may be pronounced differently, such speech synthesis cannot be done without knowing the language of the text. For instance, the word "pain" in English is pronounced completely differently to the word "pain" in French, so a screen reader that doesn't know whether it's reading English or French won't know how to pronounce "pain".

Using the lang attribute indicates to a screen reader what language some text is in and thus allows it to pronounce the word correctly.

I recorded a demonstration of this using Narrator, the built-in screen reader for Windows. (If you'd like to reproduce this, do note that you'll need to have both the English and French voice packages installed via the Speech settings page in the Windows Settings app, and have English as your default voice.) The demo uses a HTML page with the following content:

<h5>No lang specified:</h5>
<p>J'aime le pain</p>

<h5>French:</h5>
<p lang="fr">J'aime le pain</p>

As you can hear in the recording I uploaded at https://www.youtube.com/watch?v=7J1I65sn1CQ, Microsoft George (the default English voice) butchers the pronunciation of the French phrase (pronouncing it "Jay aim le payne"), whereas Microsoft Hortense (the default French voice) pronounces it correctly.

lang needed for text rendering

Perhaps surprisingly, the benefits of the lang attribute are not limited to disabled people using speech-synthesizing assistive tech. Setting lang can also affect text rendering, since the correct way to render some text can be language-dependent.

There are a couple of different mechanisms by which the lang you set can affect how text gets rendered:

  • different fonts being selected based on the lang attribute, either:

    • based on the browser's default font selection rules, or
    • because you've explicitly set up language-specific fonts using :lang selectors in your CSS

    or

  • fonts having language-specific rules included in them, such as language-specific alternative glyphs or language-specific rules about which sequences of characters to substitute with a ligature

Below I will present a couple of interesting examples I could discover of such language-specific rendering happening.

Language-dependent forms of Han characters

There exist many Han (Chinese) characters that have been adopted in other east-Asian languages, such as Japanese (where such characters are called "Kanji"). The proper way to draw these characters sometimes differs between Chinese and the other languages that have assimilated them, yet, due to Unicode's Han unification, there only exists a single Unicode code point to represent the character, rather than a distinct code point for each language-specific variant of it. Several examples are listed in the Examples of language-dependent glyphs section of the Wikipedia article linked above.

When rendering such a character, in order to know which glyph to display (for instance, whether to display the Japanese Kanji or the Chinese hanzi), the browser needs to know the language of the text in which the character appears.

To try to see your browser considering text's language in this way, save the following HTML to a file and open it in your browser:

Chinese: <span lang="zh">飴</span>
<br>
Japanese: <span lang="ja">飴</span>

Note that the same character, , is used in both spans. But they display differently in the browser, at least in Chrome on my Windows PC:

Screenshot demonstrating the point above

As you can see, the Kanji rendered in the span marked as Japanese is different in several ways from the hanzi rendered in the span marked as Chinese. By inspecting each span in the Chrome dev tools and looking at the "Rendered Fonts" section, I can see that this is because Chrome has used different fonts for the two spans - namely Microsoft YaHei for the Chinese span and Yu Gothic for the Japanese one.

fi ligatures getting disabled for Turkish text

As described at https://en.wikipedia.org/wiki/Ligature_(writing)#Stylistic_ligatures, a stylistic ligature is used in many fonts that merges together the letters fi into a single combined glyph, where the top-right corner of the f merges with the dot above the i. In most languages, like English, this looks pretty and doesn't make the text any less readable.

Image showing the combined "fi" glyph

However, such a ligature is problematic in Turkish or other languages where the dotted and dotless I both exist and are distinct characters, because it makes it impossible to tell whether it represents fi (an f followed by a dotted i) or (an f followed by a dotless ı).

For that reason, fonts that include a substitution of fi with such a ligature will hopefully have that substitution only occur in languages for which it's appropriate. As I understand it, in OpenType, such rules are implemented by making "features" in the font specific to particular "language systems" via the Language System Table.

To see this in action, I downloaded a font with such a fi ligature - specifically Okta Neue - and created the following demo page:

<style>
    @font-face {
        font-family: oktaneue;
        src: url("Groteskly Yours - Okta Neue UltraLight.otf");
    }
    * {
        font-family: oktaneue;
    }
</style>
<span lang="en">Lütfiye</span>
<br>
<span lang="tr">Lütfiye</span>

Note that this time - unlike in the earlier example with hanzi and Kanji - both spans are using the same font. But, because the font itself contains language-specific features, the spans nonetheless render differently:

Screenshot of the aforementioned example page

As you can see, the fi ligature gets used for the span labelled as English, but not for the one labelled as Turkish - which is what we wanted!

like image 5
Mark Amery Avatar answered Oct 17 '22 15:10

Mark Amery