Semantics, standards and using the "lang" attribute for source code in markup

Tags:

I haven't been able to find authorative explanations, microformats or guidelines for the following, so I'm throwing it open. If I've missed something, speak up!

Let's say you have an HTML page that includes an example of some programming source code inside a <pre> element:

<pre>
    # code...
</pre>

(Update: As Pekka points out below, <code> might be better than <pre> but the following examples/discussion can apply to both. And as Brian Campbell points out both elements should of course be used for preformatted code)

Now: How do you – in a semantically correct and spec compliant way – declare the programming language of the `<pre>` block's contents?

This would be useful information to include in the markup in a semantically consistent way.

The obvious choice, from a semantic standpoint, would be to use the lang attribute:

<pre lang="ruby">

But according to the HTML 4 spec, section 8.1.1:

The lang attribute's value is a language code that identifies a natural language [...] Computer languages are explicitly excluded from language codes.

(emphasis mine)

And besides, "ruby" isn't a standard language code anyway.

The spec does allow for adding "experimental" or "private use" codes using the x primary tag. The example from the spec is lang="x-klingon".

In theory, you could use x-ruby, x-java and so forth to declare the programming languge contained in the <pre> block – except that it seems the spec frowns upon using the lang attribute for programming languages in general.

The HTML 5 spec on the topic doesn't make matters any clearer. The spec itself doesn't explicitly mention "natural" vs "programming" languages. Instead it refers the reader to BCP 47, which states (again):

Language tags are used to help identify languages [...] but excludes languages not intended primarily for human communication, such as programming languages.

However, it goes on to mention (in section 4.1, page 56) the zxx primary language subtag, which:

identifies content for which a language classification is inappropriate or does not apply. Some examples might include instrumental or electronic music [...] or programming source code.

(emphasis mine)

Again, the spec seems to contradict itself, but it opens up the possiblilty of using zxx-x-ruby (or similar) as a fully spec-compliant way of both declaring something to be written in a language (just not a human one) and declaring the specific (non-human) language involved.

So, is there any semblance of a standard/microformat/microsyntax/gentleman's agreement/anything on what to do?

Personally, I like zxx-x-ruby as its the most complete. x-ruby by itself is shorter and neater of course, but unless I'm mistaken, the <pre> block would still inherit the primary language of its parent (e.g. en or fr or similar).

Addendum:

As Pekka mentions below, the <code> tag would probably be more appropriate, and semantically it'd be very neat to simply say <code lang="...">. However, the <code> tag is also an inline element, and I was initially thinking only of longer runs of source code, i.e. declaring the language for all <code> elements contained in block-level <pre> elements.

Luckily, the lang attribute is global and can be applied to either element, so either one would work.

Second: I accidentally typed "zzx" everywhere instead of the correct "zxx"! It's one 'z', two 'x's. Apologies for the confusion.

820

asked Feb 27 '11 16:02

Flambino

1 Answers

To answer this question, we should look at two things; any potentially relevant specifications, and what is actually done in the real world. You've already mentioned what the relevant specifications have said on the lang attribute; it is generally used for indicating the human language of the content referenced, not the programming language. While BCP 47 mentions the zxx tag for non-linguistic content, I don't believe that it is really appropriate to use the lang attribute and zxx subtag for specifying the programming language. The reason is that most source code does actually have some linguistic content, which is in a natural language; comments, variable names, strings, and the like. The lang attribute should probably be used to indicate these, especially in cases like use of CJK characters where font selection might be based on the lang attribute. The programming language contained within a code example is really orthogonal to the human language contained within it; conflating the two will likely lead to confusion, not clarity.

So, let's check the specs for an alternative to the lang attribute. As Pekka points out in another answer, the <code> element is more semantically meaningful for marking up source code than the <pre> element, so let's check there. According to the HTML5 spec:

The code element represents a fragment of computer code. This could be an XML element name, a filename, a computer program, or any other string that a computer would recognize.

Although there is no formal way to indicate the language of computer code being marked up, authors who wish to mark code elements with the language used, e.g. so that syntax highlighting scripts can use the right rules, may do so by adding a class prefixed with "language-" to the element.

...
The following example shows how a block of code could be marked up using the pre and code elements.
<pre><code class="language-pascal">var i: Integer;
begin
   i := 1;
end.</code></pre>
A class is used in that example to indicate the language used.

Now, this isn't a formal specification, just an informal recommendation for how you could use a class to indicate the language represented. The example also shows how to use both a <pre> tag and <code> tag to mark up a block of code.

We can look elsewhere for any sort of standards, but I haven't found any; there are no microformats for code formatting, and I haven't found any other specs that mention it. So, we move on to what people actually do. The best way to discover this is to look at what HTML syntax highlighting libraries do, since they are the main producers and consumers of code embedded in web pages in which the language actually matters.

There are two main types of HTML syntax highlighters; those that run on the server or offline, in Ruby or Python or PHP, and produce static HTML and CSS to be displayed by the browser, and those written in JavaScript, which find and highlight <pre> or <code> elements on the client side. The second category is more interesting, as they need to detect the language from the HTML provided to them; in the first category, you usually specify the language manually through the API or through some mechanism specific to your wiki, blog, or CMS syntax, and so there is no actual consumer of any language information that might be embedded in the HTML. We'll take a look at both categories for the sake of completeness.

For JavaScript syntax highlighters, I've found the following, with examples of their syntax for specifying a code block and its language:

SyntaxHighligher: <pre class="brush: html">...</pre>. Appears to completely ignore how class should be used by introducing its own syntax for class attributes based on CSS syntax with the brush keyword used to indicate the language. Also has an option for using the <script> tag, to make it easier to copy and paste code in without having to escape <, using the same class syntax.
Highlight.js: <pre><code class="html">...</code></pre> or class="language-html" or the same on <pre>. This gives you several options, one of which corresponds to the recommendation in the HTML5 spec, the other simply uses the bare language name as the class name.
SHJS: <pre class="sh_html">...</pre>. Uses its own prefix for language names in the class, and only works on <pre>, not other elements.
beautyOfCode: <pre class="code"><code class="html">...</code></pre>. Based on SyntaxHighlighter, but with a somewhat less weird syntax. Requires a the <pre> tag with class code and the code tag with a class indicating the language.
Chili: <code class="html">...</code>. Uses just the <code> tag, and uses the bare language as a class name.
Lighter.js: <pre class="html">...</code>. Uses the bare language as a class name. You select the elements it will apply to using the API, but the example demonstrates it on <pre> tags.
DlHighlight: <pre name="code" class="html">...</pre>. Uses the bare language as a class name. You choose via the API what type of element to highlight (the example used pre) and the value of the name attribute to look for to indicate that you want syntax highlighting. I believe that this is an abuse of the name attribute.
google-code-prettify: <pre class="prettyprint lang-html">. Uses class names prefixed with lang- to specify the language, and the class prettyprint to indicate that you want syntax highlighting. The language class is optional; it will try to auto-detect the language if not specified.
JUSH: <code class="jush-html">...</code> or <code class="language-html">...</code>. Uses the code tag, with languages in a class prefixed by jush- or language-.
Rainbow: <pre><code data-language="javascript">...</code></pre> uses the custom attribute data-language, applied to either a <code> element, or a <pre> element, in order to support sites like Tumblr which strip out <code> elements.
Prism: <pre><code class="language-css">...</code></pre> follows the HTML5 spec for nested <pre> and <code>, and the recommendation for the class name.

For server-based and offline syntax highlighters, the majority (CodeRay, UltraViolet, Pygments, Highlight) do not embed any language information in the HTML they output at all. GeSHi is the only one I found that embeds the language, as <pre class="html">...</pre>, a <pre> tag with a bare language name as the class.

Out of that list, there seems to be no real consensus. The most popular option is just using the bare language name as a class. The next most popular is using some form of prefixed language name, either prefixed by the library name, lang-, or language-. There are a few that have their own strange conventions, or don't specify the language in the HTML at all.

While the only thing common enough to be a de-facto standard is using the bare language name as a class, I would recommend going with what the HTML5 spec recommends, a class name of language- followed by the name of the language. This is supported by a few of the syntax highlighters, the rest could probably be easily modified to support it. It is less ambiguous and less likely to conflict with other classes than just the bare language name as a class. And, even if not formally specified, it is at least mentioned in a spec.

I would also use the <code> tag to indicate source code, either bare or embedded in a <pre> tag; the combination of a <code> tag and language- prefixed class can be used to indicate that you have source code in a particular language, and could be used to indicate you want it to be highlighted, and is clearer and better matches the semantics of the elements than some of the other indicators used by syntax highlighting libraries. For cases in which a <code> tag can't be used, such as embedding in sites that accept only a limited HTML subset like Tumblr, just using the <pre> tag with the same class convention is probably best.

edit to add: The CommonMark specification, which attempts to standardize Markdown so that implementations can be interoperable, producing the same HTML given the same input, has also adopted this suggested convention. It adds fenced code blocks to Markdown, surrounded with ``` or ~~~, which can be easier to use than indentation based code blocks. Immediately following the opening fence can be an info string, which is defined as:

An info string can be provided after the opening code fence. Opening and closing spaces will be stripped, and the first word, prefixed with language-, is used as the value for the class attribute of the code element within the enclosing pre element.

It can be instructive also the check what actual implementations do. Trying out a fenced code block on Babelmark shows that of those implementations that support fenced code blocks (not all do as it's an extension to the original Markdown), we see the following breakdown:

showdown, blakfriday, haskell markdown: <pre><code class="python">...</code></pre>
marked: <pre><code class="lang-python">...</code></pre>
commonmark, parsedown, cebe/markdown: <pre><code class="language-python">...</code></pre>
cheapskate, minima: <pre class="python">...</pre>
pandoc: <div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python">...</code></pre></div> (quite the overkill)
Maruku: <pre class="python"><code class="python">...</code></pre>

Looking at other document markup languages that convert to HTML and have some understanding of code blocks:

AsciiDoc: <pre>...</pre>; simply uses Pygments to highlight and does not include language information in the HTML.
rst2html gave me <pre class="code python literal-block">...</pre>, highlighted with Pygments.
Sphinx: <div class="highlight-python"><div class="highlight"><pre>...</pre></div></div>, also highlighted with Pygments.

So, overall, fairly large diversity in choices by different projects, but there does seem to be some movement towards standardizing on <pre><code class="language-python">...</code></pre>.

140

answered Sep 24 '22 22:09

Brian Campbell

Related questions
                            
                                Recommended PyUnit tutorials? [closed]
                            
                                Will read() ever block after select()?
                            
                                merging sorted arrays [duplicate]
                            
                                how to keep all methods in a class with ProGuard
                            
                                End of nonblocking file
                            
                                Permute all unique enumerations of a vector in R
                            
                                Facebook Like Widget on Fan page, Comment area out of visible area
                            
                                How can I determine if my convolution is separable?
                            
                                Can Boost.Spirit be theoretically/practically used to parse C++(0x) (or any other language)?
                            
                                How to achieve test isolation testing Oracle PL/SQL?
                            
                                What is the most mature/stable mysql node.js module
                            
                                Testing backbone.js application with jasmine - how to test model bindings on a view?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Semantics, standards and using the "lang" attribute for source code in markup

Tags:

Now: How do you – in a semantically correct and spec compliant way – declare the programming language of the `<pre>` block's contents?

So, is there any semblance of a standard/microformat/microsyntax/gentleman's agreement/anything on what to do?

Addendum:

Flambino

People also ask

1 Answers

Brian Campbell

Recent Activity

Donate For Us

Semantics, standards and using the "lang" attribute for source code in markup

Tags:

Now: How do you – in a semantically correct and spec compliant way – declare the programming language of the <pre> block's contents?

So, is there any semblance of a standard/microformat/microsyntax/gentleman's agreement/anything on what to do?

Addendum:

Flambino

People also ask

1 Answers

Brian Campbell

Related questions

Recent Activity

Donate For Us

Now: How do you – in a semantically correct and spec compliant way – declare the programming language of the `<pre>` block's contents?