Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to build an HTML parser?

before you start linking to RegEx match open tags except XHTML self-contained tags read whole question.

I'd like to write an HTML parser (only for HTML 5, it should check if it is HTML 5 and if not, return an error) just to learn myself something new, but I don't know what is the best way to do that. Let me show you an example:

<!doctype html>
<html>
<head>
    <!-- #TITLE -->
    <title>Just an example</title>
</head>
<body>
    <p class='main'>Simple paragraph with an <a href='/a.html'>anchor</a></p>
</body>
</html>

Now, could anyone show me how to parse this (final form doesn't matter, just a concept)? I had some ideas (like using recursive functions, or making references to array which holds actual tag), but I don't think these were the best concepts. Should I check char by char and then call specific functions or use regular expressions (explained below)?

By using regular expressions I don't mean one pattern for whole tag. I rather mean using one pattern for tagname (and if this one returns true, check next patterns), then for attribute (and if this one returns true, check again), and lastly check for end of tag.

What should I do when I find tag? Run a loop which checks for tags (and if it finds tag, call it again and again...)? But for me it seems like recursive function or at least half-recursive when function X calls Y which calls X...

So the final question is: what is the most efficient and correct structure for that?

like image 533
user1951214 Avatar asked Aug 01 '13 16:08

user1951214


People also ask

How do you parse HTML?

HTML parsing involves tokenization and tree construction. HTML tokens include start and end tags, as well as attribute names and values. If the document is well-formed, parsing it is straightforward and faster. The parser parses tokenized input into the document, building up the document tree.

Can I use an XML parser to parse HTML?

You can try parsing an HTML file using a XML parser, but it's likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don't understand. XML parsers will fail to parse any HTML document that uses any of those features.

Why HTML parser is used?

The HTML parser is a structured markup processing tool. It defines a class called HTMLParser, ​which is used to parse HTML files. It comes in handy for web crawling​.


1 Answers

@Kian's answer mentions using a lexer, but in terms of algorithms I think you'll want to use recursion. HTML is after all a recursive structure:

<div>
    <div>
        <div>
        </div>
    </div>
</div>

Here is a naive JS example - although it's not a complete implementation. (I've included no support for <empty /> elements; for <!-- comments -->; for &entities;; for xmlns:namespaces... writing a full fledged HTML or XML parser is a huge undertaking, so don't take it lightly)

This solution notably skips over the process of lexical analysis, but I've deliberately omitted that to contrast my answer with @Kian's.

var markup = "<!DOCTYPE html>\n"+
             "<html>\n"+
             " <head>\n"+
             "   <title>Example Input Markup</title>\n"+
             " </head>\n"+
             " <body>\n"+
             "   <p id=\"msg\">\n"+
             "     Hello World!\n"+
             "   </p>\n"+
             " </body>\n"+
             "</html>";

parseHtmlDocument(markup);

// Function definitions

function parseHtmlDocument(markup) {
    console.log("BEGIN DOCUMENT");
    markup = parseDoctypeDeclaration(markup);
    markup = parseElement(markup);
    console.log("END DOCUMENT");
}

function parseDoctypeDeclaration(markup) {
    var regEx = /^(\<!DOCTYPE .*\>\s*)/i;
    console.log("DOCTYPE DECLARATION");
    var matches = regEx.exec(markup);
    var doctypeDeclaration = matches[1];
    markup = markup.substring(doctypeDeclaration.length);
    return markup;
}

function parseElement(markup) {
    var regEx = /^\<(\w*)/i;
    var matches = regEx.exec(markup);
    var tagName = matches[1];
    console.log("BEGIN ELEMENT: "+tagName);
    markup = markup.substring(matches[0].length);
    markup = parseAttributeList(markup);
    regEx = /^\>/i;
    matches = regEx.exec(markup);
    markup = markup.substring(matches[0].length);
    markup = parseNodeList(markup);
    regEx = new RegExp("^\<\/"+tagName+"\>");
    matches = regEx.exec(markup);
    markup = markup.substring(matches[0].length);
    console.log("END ELEMENT: "+tagName);
    return markup;
}

function parseAttributeList(markup) {
    var regEx = /^\s+(\w+)\=\"([^\"]*)\"/i;
    var matches;
    while(matches = regEx.exec(markup)) {
        var attrName = matches[1];
        var attrValue = matches[2];
        console.log("ATTRIBUTE: "+attrName);
        markup = markup.substring(matches[0].length);
    }
    return markup;
}

function parseNodeList(markup) {
    while(markup) {
        markup = parseTextNode(markup);
        var regEx = /^\<(.)/i;
        var matches = regEx.exec(markup);
        if(matches[1] !== '/') {

            markup = parseElement(markup);
        }
        else {
            return markup;
        }
    }
}

function parseTextNode(markup) {
    var regEx = /([^\<]*)\</i;
    var matches = regEx.exec(markup);
    markup = markup.substring(matches[1].length);
    return markup;
}

Ideally each of these functions would map very closely onto the grammar defined in the XML specification. For example, the specification defines an element like so:

element    ::=    EmptyElemTag | STag content ETag

... so ideally we'd want the parseElement() function to look more like this:

function parseElement(markup) {
    if(nextTokenIsEmptyElemTag) { // this kind of logic is where a lexer will help!
        parseEmptyElemTag(markup);
    }
    else {
        parseSTag(markup);
        parseContent(markup);
        parseETag(markup);
    }
}

... but I've cut some corners in writing my example, so it doesn't reflect the actual grammar as closely as it should.

like image 188
Richard JP Le Guen Avatar answered Nov 11 '22 23:11

Richard JP Le Guen