Parsing simple HTML into tree

Tags:

I'd like to ask what is the best way to parse a simple html code into DOM Node tree like this one:

example tree

Here are some constraints I am facing:

HTML code will have only pair tags, no attributes and I've to ignore spaces
There can be Text between tags like <p>, <h1>, <a> etc.
I can't use libraries

I was thinking about regex, but never tried it so.. Any ideas?

Every node in the tree is this struct:

  typedef struct tag
  {
      struct tag* parent;
      struct tag* nextSibling;
      struct tag* previousSibling;
      struct tag* firstChild;
      struct tag* lastChild;     
      char* name;
      char* text;     
  }node;

936

asked Mar 26 '13 21:03

user2213470

Video Answer

1 Answers

I know it isin't in C, but that presentation might give you some input on how you could efficiently tackle the problem.

https://web.archive.org/web/20120115060003/http://cuddle.googlecode.com/hg/talk/lex.html#landing-slide

I also wrote a very simple parser example in JavaScript (again not in C, but hopefully you know JS as well) based on your initial requirements, meaning that it will not parse any attributes and do not handle self-closing tags and many other things that should be handled according to the HTML specs. It will produce a parse tree in this format:

{
    cn: [{
        tag: 'html',
        cn: [{
            tag: 'body',
            cn: [
                { tag: 'h1', cn: ['test'] },
                ' some text ',
                ...
            ]
        }] 
    }]
}

Here's the code and fiddle: http://jsfiddle.net/LUpyZ/3/

Note that white space is not ignored and will be captured in text nodes.

var html = '<html><body><h1>test</h1> some text <div> <p>text</p></div></body></html>';

var parseHTML = (function () {
    var nodesStack = [],
        i = 0,
        len = html.length,
        stateFn = parseText,
        parseTree = { cn: [] },
        alphaNumRx = /\w/,
        currentNode = parseTree,
        text = '',
        tag = '',
        newNode;

    function parseTag(token) {
        if (token === '/') {
            return parseCloseTag;
        }

        i--; //backtrack to first tag character
        return parseOpenTag;
    }

    function parseCloseTag(token) {
        if (token === '>') {
            if (currentNode.tag !== tag) {
                throw 'Wrong closed tag at char ' + i;
            }

            tag = '';

            nodesStack.pop();

            currentNode = currentNode.parentNode;

            return parseText;            
        }

        assertValidTagNameChar(token);

        tag += token;

        return parseCloseTag;
    }

    function parseOpenTag(token) {
        if (token === '>') {
            currentNode.cn.push(newNode = { tag: tag, parentNode: currentNode,  cn: []});
            nodesStack.push(currentNode = newNode);

            tag = '';

            return parseText;
        }

        assertValidTagNameChar(token);

        tag += token;

        return parseOpenTag;
    }

    function parseText(token) {
        if (token === '<') {

            if (text) {
                currentNode.cn.push(text);
                text = '';
            }

            return parseTag;
        }

        text += token;

        return parseText;
    }

    function assertValidTagNameChar(c) {
        if (!alphaNumRx.test(c)) {
            throw 'Invalid tag name char at ' + i;
        }
    }

    return function (html) {
        for (; i < len; i++) {
            stateFn = stateFn(html[i]);
        }

        if (currentNode = nodesStack.pop()) {
            throw 'Unbalanced tags: ' + currentNode.tag + ' is never closed.';
        }

        return parseTree;
    };
})();

console.log(parseHTML(html));

139

answered Sep 27 '22 18:09

plalx

Related questions
                            
                                retrieving TOS value on a TCP socket
                            
                                Implementing C extension functions for PostgreSQL - how do I do this? (passing data between C/PostgreSQL)
                            
                                Sum of two squares in C
                            
                                Do I understand how Unix file descriptors work in C?
                            
                                How to define a function in C if it hasn't been defined before?
                            
                                Filter packets in network stack while sniffing packets on Linux?
                            
                                Multiple execution of same thread subroutine on commenting pthread_join for that thread [duplicate]
                            
                                Dynamic calls to WinRT interfaces
                            
                                Crenshaw's "Let's Build a Compiler": Transcription to C and x86 Assembler? [closed]
                            
                                Fragmentation-resistant Microcontroller Heap Algorithm
                            
                                How to create a NSAutoreleasePool without Objective-C?
                            
                                Virtual file backed by memory (reverse MMAP)?
                            
                                Compiling static library for iOS that uses OpenMP
                            
                                Modify return address on stack
                            
                                How to implement selection and crossover in using genetic algorithm to find square root of a number in C
                            
                                Write a C wrapper around C++ classes with C++ callbacks
                            
                                pthreads: how to assert code is run in a single threaded context
                            
                                Build a makefile dependency / inheritance tree
                            
                                ZeroMQ: asynchronous replies
                            
                                GCC 4.7.2 Optimization Problems

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing simple HTML into tree

Tags:

c

parsing

html-parsing

user2213470

People also ask

Video Answer

1 Answers

plalx

Recent Activity

Donate For Us