In HTML, it's good to have a lang
attribute in <html>
, e.g. <html lang="en">
.
How is this useful?
If this is used for translation, even if the language is set to English and there are all Chinese text in the document Google Translate detects it as Chinese, not English (this means Google ignores the lang
attribute).
The lang (or sometimes the xml:lang ) attribute specifies the natural language of the content of a web page. An attribute on the html tag sets the language for all the text on the page.
Always add a lang attribute to the html tag to set the default language of your page. If this is XHTML 1. x or an HTML5 polyglot document served as XML, you should also use the xml:lang attribute (with the same value). If your page is only served as XML, just use the xml:lang attribute.
Whether or not you use the HTTP header, you should always declare the language of the text in a page using a language attribute on the html tag.
You should always include the lang attribute inside the <html> tag, to declare the language of the Web page. This is meant to assist search engines and browsers. Country codes can also be added to the language code in the lang attribute.
I am quoting this from W3C:
Declaring language in HTML
Always use a language attribute on the
html
tag to declare the default language of the text in the page. When the page contains content in another language, add a language attribute to an element surrounding that content.Use the
lang
attribute for pages served as HTML, and thexml:lang
attribute for pages served as XML. For XHTML 1.x and HTML5 polyglot documents, use both together.Use language tags from the IANA Language Subtag Registry.
Also a good read is Why use the language attribute?.
You asked "how is this useful".
"The
<lang=>
attribute can be used to declare the language of a Web page or a portion of a Web page. This is meant to assist search engine spiders, page formatting and screen reader technology"
Source: http://symbolcodes.tlt.psu.edu/web/tips/langtag.html (Wayback Machine link)
No mention of translation - but often a search engine spider will not want to parse through a document "in the wrong language" - its index file will grow (lots of new words), and the results will not be useful to the user (who cannot read the language, and who is using the wrong search terms).
The advent of smart translation technology (like Google's, referred to above) means that some search engines can see a page in one language, translate it, and figure out that someone searching for "cow" may be interested in this page that mentions "vache" and has <lang="fr">
.
The lang
attribute is needed by screen readers to let them pronounce words correctly, and also (perhaps surprisingly) sometimes needed to allow text to be rendered correctly by the browser.
lang
needed for speech synthesisSome blind or visually impaired people use speech-synthesizing screen readers that speak the words on the screen. Since two words from different languages that are spelt identically may be pronounced differently, such speech synthesis cannot be done without knowing the language of the text. For instance, the word "pain" in English is pronounced completely differently to the word "pain" in French, so a screen reader that doesn't know whether it's reading English or French won't know how to pronounce "pain".
Using the lang
attribute indicates to a screen reader what language some text is in and thus allows it to pronounce the word correctly.
I recorded a demonstration of this using Narrator, the built-in screen reader for Windows. (If you'd like to reproduce this, do note that you'll need to have both the English and French voice packages installed via the Speech settings page in the Windows Settings app, and have English as your default voice.) The demo uses a HTML page with the following content:
<h5>No lang specified:</h5>
<p>J'aime le pain</p>
<h5>French:</h5>
<p lang="fr">J'aime le pain</p>
As you can hear in the recording I uploaded at https://www.youtube.com/watch?v=7J1I65sn1CQ, Microsoft George (the default English voice) butchers the pronunciation of the French phrase (pronouncing it "Jay aim le payne"), whereas Microsoft Hortense (the default French voice) pronounces it correctly.
lang
needed for text renderingPerhaps surprisingly, the benefits of the lang
attribute are not limited to disabled people using speech-synthesizing assistive tech. Setting lang
can also affect text rendering, since the correct way to render some text can be language-dependent.
There are a couple of different mechanisms by which the lang
you set can affect how text gets rendered:
different fonts being selected based on the lang
attribute, either:
:lang
selectors in your CSSor
fonts having language-specific rules included in them, such as language-specific alternative glyphs or language-specific rules about which sequences of characters to substitute with a ligature
Below I will present a couple of interesting examples I could discover of such language-specific rendering happening.
There exist many Han (Chinese) characters that have been adopted in other east-Asian languages, such as Japanese (where such characters are called "Kanji"). The proper way to draw these characters sometimes differs between Chinese and the other languages that have assimilated them, yet, due to Unicode's Han unification, there only exists a single Unicode code point to represent the character, rather than a distinct code point for each language-specific variant of it. Several examples are listed in the Examples of language-dependent glyphs section of the Wikipedia article linked above.
When rendering such a character, in order to know which glyph to display (for instance, whether to display the Japanese Kanji or the Chinese hanzi), the browser needs to know the language of the text in which the character appears.
To try to see your browser considering text's language in this way, save the following HTML to a file and open it in your browser:
Chinese: <span lang="zh">飴</span>
<br>
Japanese: <span lang="ja">飴</span>
Note that the same character, 飴
, is used in both span
s. But they display differently in the browser, at least in Chrome on my Windows PC:
As you can see, the Kanji rendered in the span marked as Japanese is different in several ways from the hanzi rendered in the span marked as Chinese. By inspecting each span in the Chrome dev tools and looking at the "Rendered Fonts" section, I can see that this is because Chrome has used different fonts for the two spans - namely Microsoft YaHei for the Chinese span and Yu Gothic for the Japanese one.
fi
ligatures getting disabled for Turkish textAs described at https://en.wikipedia.org/wiki/Ligature_(writing)#Stylistic_ligatures, a stylistic ligature is used in many fonts that merges together the letters fi into a single combined glyph, where the top-right corner of the f merges with the dot above the i. In most languages, like English, this looks pretty and doesn't make the text any less readable.
However, such a ligature is problematic in Turkish or other languages where the dotted and dotless I both exist and are distinct characters, because it makes it impossible to tell whether it represents fi (an f followed by a dotted i) or fı (an f followed by a dotless ı).
For that reason, fonts that include a substitution of fi with such a ligature will hopefully have that substitution only occur in languages for which it's appropriate. As I understand it, in OpenType, such rules are implemented by making "features" in the font specific to particular "language systems" via the Language System Table.
To see this in action, I downloaded a font with such a fi ligature - specifically Okta Neue - and created the following demo page:
<style>
@font-face {
font-family: oktaneue;
src: url("Groteskly Yours - Okta Neue UltraLight.otf");
}
* {
font-family: oktaneue;
}
</style>
<span lang="en">Lütfiye</span>
<br>
<span lang="tr">Lütfiye</span>
Note that this time - unlike in the earlier example with hanzi and Kanji - both spans are using the same font. But, because the font itself contains language-specific features, the spans nonetheless render differently:
As you can see, the fi ligature gets used for the span labelled as English, but not for the one labelled as Turkish - which is what we wanted!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With