Reading some related questions made me think about the theoretical nature of HTML. I'm not talking about XHTML-like code here. I'm talking about stuff like this crazy piece of markup, which is perfectly valid HTML(!) <pre class="prettyprint"><code><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html<head> <title// </span </> </code></pre> So given the enormous complexity that SGML injects here, is HTML a context-free language? Is it a formal language anyway? With a grammar? What about HTML5? I'm new to the concept of formal languages, so please bear with me. And yes, I have read the wikipedia article ;)

Context Free is a concept from language theory that has important implications in parser implementation. A Context Free Language can be described by a Context Free Grammar, which is one in which all rules have a single non-terminal symbol at the left of the arrow: <pre class="prettyprint"><code>X→δ </code></pre> That simple restriction allows <code>X</code> to be substituted by the right-hand side of the rules in which appears on the left without regard to what came before or after. For example, if while deriving or parsing one arrives at: <pre class="prettyprint"><code>αXλ </code></pre> one is sure that <pre class="prettyprint"><code>αδλ </code></pre> is also valid. Examples of non-context-free rules would be: <pre class="prettyprint"><code>XY→δ Xa→δ aX→δ </code></pre> Those would require knowing what could be derive arround <code>X</code> to determine if a rule applies, and that leads to non-determinism (what's around <code>X</code> would also like to know what it derives to), which is a no-no in parsing, and in any case we want a language to be well-defined. The only way to prove that a language is context-free is by proving that there's a context-free grammar for it, which is not an easy task. Most programming languages one comes about are already described by CFGs, so the job is done. But there are other languages, including programming languages, that are described using logic or plain English, so work is required to find if they are context-free. For HTML, the answer about its context-freedom is yes. SGML is a well defined Context Free Language, and HTML defined on top of it is also a CFL. Parsers and grammars for both languages abound on the Web. At any rate, that there exist LL(k) grammars for valid HTML is enough proof that the language is context-free, because LL is a proven subset of CF. But the way HTML evolved over the life of the Web forced browsers to treat it as not that well defined. Modern Web browsers will go out of their way to try to render something sensible out of almost anything they find. The grammars they use are not CFGs, and the parsers are far more complex than the ones required for SGML/HTML. HTML is defined at several levels. <ol> <li>At the lexical level there are the rules for valid characters, identifiers, strings, and so on. </li> <li>At the next level is XML, which consists of the opening and closing <code><tags></code> that define a hierarchical document structure. You can use XML or something XML-like for any purpose, like <code>Apache Ant</code> does for build scripts.</li> <li>At the next level are the tags that are valid in HTML, and the rules about which tags may be nested within which tags.</li> <li>At the next level are the rules about which attributes are valid for which tags, languages that can be embedded in HTML like CSS and JavaScript.</li> <li>Finally, you have the semantic rules about what a given HTML document means.</li> </ol> The syntactic part is defined well enough that it can be verified. The semantic part is much larger than the syntactic one, and is defined in terms of browser actions regarding HTTP, and the Document Object Model (DOM), and how a model should be rendered to the screen. In the end: <ol> <li>Parsing correct HTML is extremely easy (it's context-free and LL/LR).</li> <li>Parsing the HTML that actually exists over the Web is difficult.</li> <li>Implementing the semantics (a browser) over HTML/CSS/DOM is extremely difficult.</li> </ol>

Is HTML a context-free language?

Tags:

html

grammar

language-theory

sgml

Reading some related questions made me think about the theoretical nature of HTML.

I'm not talking about XHTML-like code here. I'm talking about stuff like this crazy piece of markup, which is perfectly valid HTML(!)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"> <html<head> <title// <p ltr<span id=p></span</p> </>

So given the enormous complexity that SGML injects here, is HTML a context-free language? Is it a formal language anyway? With a grammar?

What about HTML5?

_{I'm new to the concept of formal languages, so please bear with me. And yes, I have read the wikipedia article ;)}

211

asked Mar 03 '11 02:03

user123444555621

1 Answers

Context Free is a concept from language theory that has important implications in parser implementation. A Context Free Language can be described by a Context Free Grammar, which is one in which all rules have a single non-terminal symbol at the left of the arrow:

X→δ

That simple restriction allows X to be substituted by the right-hand side of the rules in which appears on the left without regard to what came before or after. For example, if while deriving or parsing one arrives at:

αXλ

one is sure that

αδλ

is also valid. Examples of non-context-free rules would be:

XY→δ Xa→δ aX→δ

Those would require knowing what could be derive arround X to determine if a rule applies, and that leads to non-determinism (what's around X would also like to know what it derives to), which is a no-no in parsing, and in any case we want a language to be well-defined.

The only way to prove that a language is context-free is by proving that there's a context-free grammar for it, which is not an easy task. Most programming languages one comes about are already described by CFGs, so the job is done. But there are other languages, including programming languages, that are described using logic or plain English, so work is required to find if they are context-free.

For HTML, the answer about its context-freedom is yes. SGML is a well defined Context Free Language, and HTML defined on top of it is also a CFL. Parsers and grammars for both languages abound on the Web. At any rate, that there exist LL(k) grammars for valid HTML is enough proof that the language is context-free, because LL is a proven subset of CF.

But the way HTML evolved over the life of the Web forced browsers to treat it as not that well defined. Modern Web browsers will go out of their way to try to render something sensible out of almost anything they find. The grammars they use are not CFGs, and the parsers are far more complex than the ones required for SGML/HTML.

HTML is defined at several levels.

At the lexical level there are the rules for valid characters, identifiers, strings, and so on.
At the next level is XML, which consists of the opening and closing <tags> that define a hierarchical document structure. You can use XML or something XML-like for any purpose, like Apache Ant does for build scripts.
At the next level are the tags that are valid in HTML, and the rules about which tags may be nested within which tags.
At the next level are the rules about which attributes are valid for which tags, languages that can be embedded in HTML like CSS and JavaScript.
Finally, you have the semantic rules about what a given HTML document means.

The syntactic part is defined well enough that it can be verified. The semantic part is much larger than the syntactic one, and is defined in terms of browser actions regarding HTTP, and the Document Object Model (DOM), and how a model should be rendered to the screen.

In the end:

Parsing correct HTML is extremely easy (it's context-free and LL/LR).
Parsing the HTML that actually exists over the Web is difficult.
Implementing the semantics (a browser) over HTML/CSS/DOM is extremely difficult.

117

answered Oct 07 '22 17:10

Apalala

Related questions
                            
                                HTML Best Practices: Should I use &rsquo; or the special keyboard shortcut?
                            
                                Why is "&reg" being rendered as "®" without the bounding semicolon
                            
                                tabIndex doesn't make a label focusable using Tab key
                            
                                EditorConfig for VS Code not working
                            
                                In <head>, which comes first: <meta> or <title>?
                            
                                HTML5 Desktop Wrapper/Framework [closed]
                            
                                How to know when a web page was last updated?
                            
                                Floating an image to the bottom right with text wrapping around
                            
                                How do you override inline onclick event?
                            
                                Is there an alternative to HTML? [closed]
                            
                                How do I set an HTML class attribute in Markdown?
                            
                                Pass object through dataTransfer
                            
                                Force page zoom at 100% with JS
                            
                                async="async" attribute of a <script> tag in html, What does it mean?
                            
                                HTML5: How to make a form submit, after pressing ENTER at any of the text inputs?
                            
                                How can I remove a whole IndexedDB database from JavaScript?
                            
                                How to write equations in html? [closed]
                            
                                In HTML, how can I have text that is only accessible for screen readers (i.e. for blind people)?
                            
                                Using Numbers With Font Awesome
                            
                                How do I position an image at the bottom of div?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With