How to build an HTML parser?

Tags:

before you start linking to RegEx match open tags except XHTML self-contained tags read whole question.

I'd like to write an HTML parser (only for HTML 5, it should check if it is HTML 5 and if not, return an error) just to learn myself something new, but I don't know what is the best way to do that. Let me show you an example:

<!doctype html>
<html>
<head>
    <!-- #TITLE -->
    <title>Just an example</title>
</head>
<body>
    <p class='main'>Simple paragraph with an <a href='/a.html'>anchor</a></p>
</body>
</html>

Now, could anyone show me how to parse this (final form doesn't matter, just a concept)? I had some ideas (like using recursive functions, or making references to array which holds actual tag), but I don't think these were the best concepts. Should I check char by char and then call specific functions or use regular expressions (explained below)?

By using regular expressions I don't mean one pattern for whole tag. I rather mean using one pattern for tagname (and if this one returns true, check next patterns), then for attribute (and if this one returns true, check again), and lastly check for end of tag.

What should I do when I find tag? Run a loop which checks for tags (and if it finds tag, call it again and again...)? But for me it seems like recursive function or at least half-recursive when function X calls Y which calls X...

So the final question is: what is the most efficient and correct structure for that?

533

asked Aug 01 '13 16:08

user1951214

1 Answers

@Kian's answer mentions using a lexer, but in terms of algorithms I think you'll want to use recursion. HTML is after all a recursive structure:

<div>
    <div>
        <div>
        </div>
    </div>
</div>

Here is a naive JS example - although it's not a complete implementation. (I've included no support for <empty /> elements; for ; for &entities;; for xmlns:namespaces... writing a full fledged HTML or XML parser is a huge undertaking, so don't take it lightly)

This solution notably skips over the process of lexical analysis, but I've deliberately omitted that to contrast my answer with @Kian's.

var markup = "<!DOCTYPE html>\n"+
             "<html>\n"+
             " <head>\n"+
             "   <title>Example Input Markup</title>\n"+
             " </head>\n"+
             " <body>\n"+
             "   <p id=\"msg\">\n"+
             "     Hello World!\n"+
             "   </p>\n"+
             " </body>\n"+
             "</html>";

parseHtmlDocument(markup);

// Function definitions

function parseHtmlDocument(markup) {
    console.log("BEGIN DOCUMENT");
    markup = parseDoctypeDeclaration(markup);
    markup = parseElement(markup);
    console.log("END DOCUMENT");
}

function parseDoctypeDeclaration(markup) {
    var regEx = /^(\<!DOCTYPE .*\>\s*)/i;
    console.log("DOCTYPE DECLARATION");
    var matches = regEx.exec(markup);
    var doctypeDeclaration = matches[1];
    markup = markup.substring(doctypeDeclaration.length);
    return markup;
}

function parseElement(markup) {
    var regEx = /^\<(\w*)/i;
    var matches = regEx.exec(markup);
    var tagName = matches[1];
    console.log("BEGIN ELEMENT: "+tagName);
    markup = markup.substring(matches[0].length);
    markup = parseAttributeList(markup);
    regEx = /^\>/i;
    matches = regEx.exec(markup);
    markup = markup.substring(matches[0].length);
    markup = parseNodeList(markup);
    regEx = new RegExp("^\<\/"+tagName+"\>");
    matches = regEx.exec(markup);
    markup = markup.substring(matches[0].length);
    console.log("END ELEMENT: "+tagName);
    return markup;
}

function parseAttributeList(markup) {
    var regEx = /^\s+(\w+)\=\"([^\"]*)\"/i;
    var matches;
    while(matches = regEx.exec(markup)) {
        var attrName = matches[1];
        var attrValue = matches[2];
        console.log("ATTRIBUTE: "+attrName);
        markup = markup.substring(matches[0].length);
    }
    return markup;
}

function parseNodeList(markup) {
    while(markup) {
        markup = parseTextNode(markup);
        var regEx = /^\<(.)/i;
        var matches = regEx.exec(markup);
        if(matches[1] !== '/') {

            markup = parseElement(markup);
        }
        else {
            return markup;
        }
    }
}

function parseTextNode(markup) {
    var regEx = /([^\<]*)\</i;
    var matches = regEx.exec(markup);
    markup = markup.substring(matches[1].length);
    return markup;
}

Ideally each of these functions would map very closely onto the grammar defined in the XML specification. For example, the specification defines an element like so:

element    ::=    EmptyElemTag | STag content ETag

... so ideally we'd want the parseElement() function to look more like this:

function parseElement(markup) {
    if(nextTokenIsEmptyElemTag) { // this kind of logic is where a lexer will help!
        parseEmptyElemTag(markup);
    }
    else {
        parseSTag(markup);
        parseContent(markup);
        parseETag(markup);
    }
}

... but I've cut some corners in writing my example, so it doesn't reflect the actual grammar as closely as it should.

188

answered Nov 11 '22 23:11

Richard JP Le Guen

Related questions
                            
                                Why do I have to add padding to <body> when using a fixed-position top navbar in Bootstrap?
                            
                                How to set Focus on a non input element
                            
                                How to enable user select in a readOnly html element on iPad safari?
                            
                                Application Cache Error event: Resource fetch failed (-1)
                            
                                Trouble using icon fonts with CSS
                            
                                posting JSON vs traditional form encoded data as the data format for submitting a form
                            
                                How to center two columns using CSS?
                            
                                Youtube Iframe on HTTPS website
                            
                                How to put an image on the tab bar next to the title of the page on the browser? [duplicate]
                            
                                Rounded corners in Mailchimp templates - CSS or image-based
                            
                                iframe won't absolute position
                            
                                Passing values to javascript onClick()
                            
                                Save flot graph as image
                            
                                How to output the response HTML data by a jQuery AJAX request?
                            
                                jQuery .append() not rendering html entity
                            
                                Is there any HTML element that exists as the quintessential inline-block?
                            
                                Overlay a canvas over a div
                            
                                link on a bootstrap glyphicon
                            
                                How to change the color of hyperlink after click it [closed]
                            
                                Clone table rows and change IDs of new rows in jQuery

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to build an HTML parser?

Tags:

html

parsing

structure

user1951214

People also ask

1 Answers

Richard JP Le Guen

Recent Activity

Donate For Us