Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where's the bug in this tree traversal code?

There's a bug in Traverse() that's causing it to iterate nodes more than once.

Bugged Code

public IEnumerable<HtmlNode> Traverse()
{
    foreach (var node in _context)
    {
        yield return node;
        foreach (var child in Children().Traverse())
            yield return child;
    }
}

public SharpQuery Children()
{
    return new SharpQuery(_context.SelectMany(n => n.ChildNodes).Where(n => n.NodeType == HtmlNodeType.Element), this);
}

public SharpQuery(IEnumerable<HtmlNode> nodes, SharpQuery previous = null)
{
    if (nodes == null) throw new ArgumentNullException("nodes");
    _previous = previous;
    _context = new List<HtmlNode>(nodes);
}

Test Code

    static void Main(string[] args)
    {
        var sq = new SharpQuery(@"
<a>
    <b>
        <c/>
        <d/>
        <e/>
        <f>
            <g/>
            <h/>
            <i/>
        </f>
    </b>
</a>");
        var nodes = sq.Traverse();
        Console.WriteLine("{0} nodes: {1}", nodes.Count(), string.Join(",", nodes.Select(n => n.Name)));
        Console.ReadLine();

Output

19 nodes: #document,a,b,c,g,h,i,d,g,h,i,e,g,h,i,f,g,h,i

Expected Output

Each letter a-i printed once.

Can't seem to figure out where's it going wrong... node.ChildNodes does return just direct children, right? (from HtmlAgilityPack)


Full class (condensed) if you want to try and run it yourself.

public class SQLite
{
    private readonly List<HtmlNode> _context = new List<HtmlNode>();
    private readonly SQLite _previous = null;

    public SQLite(string html)
    {
        var doc = new HtmlDocument();
        doc.LoadHtml(html);
        _context.Add(doc.DocumentNode);
    }

    public SQLite(IEnumerable<HtmlNode> nodes, SQLite previous = null)
    {
        if (nodes == null) throw new ArgumentNullException("nodes");
        _previous = previous;
        _context = new List<HtmlNode>(nodes);
    }

    public IEnumerable<HtmlNode> Traverse()
    {
        foreach (var node in _context)
        {
            yield return node;
            foreach (var child in Children().Traverse())
                yield return child;
        }
    }

    public SQLite Children()
    {
        return new SQLite(_context.SelectMany(n => n.ChildNodes).Where(n => n.NodeType == HtmlNodeType.Element), this);
    }
}
like image 307
mpen Avatar asked Dec 21 '22 21:12

mpen


2 Answers

First, note that everything goes according to plan as long as we're iterating over elements that don't have any sibling. As soon as we hit element <c>, things start going haywire. It's also interesting that <c>, <d> and <e> all think they contain <f>'s children.

Let's take a closer look at your call to SelectMany():

_context.SelectMany(n => n.ChildNodes)

That iterates through the items in _context and accumulates the child nodes of each item. Let's have a look at _context. Everything should be okay, since it's a List of length 1. Or is it?

I suspect your SharpQuery(string) constructor stores sibling elements inside the same list. In that case, _context may not be of length 1 anymore and, remember, SelectMany() accumulates the child nodes of each item in the list.

For example, if _context is a list containing <c>, <d>, <e> and <f>, only <f> has children, and SelectMany() is called once for each element, it will accumulate and return the children of <f> four times.

I think that's your bug.

EDIT: Since you've posted the full class, I don't have to guess anymore. Look at the sequence of operations when you iterate over <b> (iterators replaced by lists for better comprehension):

  1. Call Children() on <b>, which returns <c>, <d>, <e> and <f>,
  2. Call Traverse() on the resulting list:
    1. Call Children() on <c>, but _context actually contains <c>, <d>, <e> and <f>, not only <c>, so that returns <g>, <h> and <i>,
    2. Call Children() on <d>, same thing since _context is the same for both <c> and <d> (and <e>, and <f>),
    3. Lather, rinse, repeat.
like image 68
Frédéric Hamidi Avatar answered Dec 24 '22 11:12

Frédéric Hamidi


Children() is broken, it selects children of siblings as well. I'd rewrite to something like this:

public IEnumerable<HtmlNode> Traverse(IEnumerable<HtmlNode> nodes)
{
    foreach (var node in nodes)
    {
        yield return node;
        foreach (var child in Traverse(ChildNodes(node)))
            yield return child;
    }
}

private IEnumerable<HtmlNode> ChildNodes(HtmlNode node)
{
    return node.ChildNodes.Where(n => n.NodeType == HtmlNodeType.Element);
}


public SharpQuery(IEnumerable<HtmlNode> nodes, SharpQuery previous = null)
{
    if (nodes == null) throw new ArgumentNullException("nodes");
    _previous = previous; // Is this necessary?
    _context = new List<HtmlNode>(nodes);
}
like image 22
Robert Jeppesen Avatar answered Dec 24 '22 11:12

Robert Jeppesen