Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HtmlElement.Parent returns wrong parent

I'm trying to generate CSS selectors for random elements on a webpage by means of C#. Some background:

I use a form with a WebBrowser control. While navigating one can ask for the CSS selector of the element under the cursor. Getting the html-element is trivial, of course, by means of:

WebBrowser.Document.GetElementFromPoint(<Point>);

The ambition is to create a 'strict' css selector leading up to the element under the cursor, a-la:

html > body > span:eq(2) > li:eq(5) > div > div:eq(3) > span > a

This selector is based on :eq operators since it's meant to be handled by jQuery and/or SizzleJS (these two support :eq - original CSS selectors don't. Thumbs up @BoltClock for helping me clarify this). So, you get the picture. In order to achieve this goal, we supply the retrieved HtmlElement to the below method and start ascending up the DOM tree by asking for the Parent of each element we come across:

    private static List<String> GetStrictCssForHtmlElement(HtmlElement element)
    {
        List<String> familyTree;
        for (familyTree = new List<String>(); element != null; element = element.Parent)
        {
            string ordinalString = CalculateOrdinalPositionAmongSameTagSimblings(element);
            if (ordinalString == null) return null;

            familyTree.Add(element.TagName.ToLower() + ordinalString);
        }
        familyTree.Reverse();

        return familyTree;
    }

    private static string CalculateOrdinalPositionAmongSameTagSimblings(HtmlElement element, bool simplifyEq0 = true)
    {
        int count = 0;
        int positionAmongSameTagSimblings = -1;
        if (element.Parent != null)
        {
            foreach (HtmlElement child in element.Parent.Children)
            {
                if (element.TagName.ToLower() == child.TagName.ToLower())
                {
                    count++;
                    if (element == child)
                    {
                        positionAmongSameTagSimblings = count - 1;
                    }
                }
            }

            if (positionAmongSameTagSimblings == -1) return null; // Couldn't find child in parent's offsprings!?   
        }

        return ((count > 1) ? (":eq(" + positionAmongSameTagSimblings + ")") : ((simplifyEq0) ? ("") : (":eq(0)")));
    }

This method has worked reliably for a variety of pages. However, there's one particular page which makes my head in:

http://www.delicious.com/recent

Trying to retrieve the CSS selector of any element in the list (at the center of the page) fails for one very simple reason:

After the ascension hits the first SPAN element in it's way up (you can spot it by inspecting the page with IE9's web-dev tools for verification) it tries to process it by calculating it's ordinal position among it's same tag siblings. To do that we need to ask it's Parent node for the siblings. This is where things get weird. The SPAN element reports that it's Parent is a DIV element with id="recent-index". However that's not the immediate parent of the SPAN (the immediate parent is LI class="wrap isAdv"). This causes the method to fail because -unsurprisingly- it fails to spot SPAN among the children.

But it gets even weirder. I retrieved and isolated the HtmlElement of the SPAN itself. Then I got it's Parent and used it to re-descend back down to the SPAN element using:

HtmlElement regetSpanElement = spanElement.Parent.Children[0].Children[1].Children[1].Children[0].Children[2].Children[0];

This lead us back to the SPAN node we begun ... with one twist however:

regetSpanElement.Parent.TagName;

This now reports LI as the parent X-X. How can this be? Any insight?

Thank you again in advance.

Notes:

  1. I saved the Html code (as it's presented inside WebBrowser.Document.Html) and inspected it myself to be 100% sure that nothing funny is taking place (aka different code served to WebBrowser control than the one I see in IE9 - but that's not happening the structure matches 100% for the path concerned).

  2. I am running WebBrowser control in IE9-mode using the instructions outlined here:

    http://www.west-wind.com/weblog/posts/2011/May/21/Web-Browser-Control-Specifying-the-IE-Version

    Trying to get WebBrowser control and IE9 to run as similarly as possible.

  3. I suspect that the effects observed might be due to some script running behind my back. However my knowledge is not so far reaching in terms of web-programming to pin it down.

Edit: Typos

like image 652
XDS Avatar asked Jul 26 '11 15:07

XDS


1 Answers

Relying on :eq() is tough! It is difficult to reliably re-select out of a DOM that is dynamic. Sure it may work on very static pages, but things are only getting more dynamic every day. You might consider changing strategy a little bit. Try using a smarter more flexible selector. Perhaps pop in some javascript like so:

predictCss = function(s, noid, noclass, noarrow) {
    var path, node = s;
    var psep = noarrow ? ' ' : ' > ';
    if (s.length != 1) return path; //throw 'Requires one element.';
    while (node.length) {
        var realNode = node[0];
        var name = (realNode.localName || realNode.tagName || realNode.nodeName);
        if (!name || name == '#document') break;
        name = name.toLowerCase();
        if(node.parent().children(name).length > 1){
            if (realNode.id && !noid) {
                try {
                    var idtest = $(name + '#' + realNode.id);
                    if (idtest.length == 1) return name + '#' + realNode.id + (path ? '>' + path : '');
                } catch (ex) {} // just ignore the exception, it was a bad ID
            } else if (realNode.className && !noclass) {
                name += '.' + realNode.className.split(/\s+/).join('.');
            }
        }
        var parent = node.parent();
        if (name[name.length - 1] == '.') { 
            name = name.substring(0, name.length - 1);
        }
        siblings = parent.children(name); 
        //// If you really want to use eq:
        //if (siblings.length > 1) name += ':eq(' + siblings.index(node) + ')';
        path = name + (path ? psep + path : '');
        node = parent;
    }
    return path
}

And use it to generate a variety of selectors:

var elem = $('#someelement');
var epath = self.model.util.predictCss(elem, true, true, false);
var epathclass = self.model.util.predictCss(elem, true, false, false);
var epathclassid = self.model.util.predictCss(elem, false, false, false);

Then use each:

var relem= $(epathclassid);
if(relem.length === 0){
    relem = $(epathclass);
    if(relem.length === 0){
        relem = $(epath);
    }
}

And if your best selector still comes out with more than one element, you'll have to get creative in how you match a dom element - perhaps levenshtein or perhaps there is some specific text, or you can fallback to eq. Hope that helps!

Btw, I assumed you have jQuery - due to the sizzle reference. You could inject the above in a self-executing anonymous function in a script tag appended to the last child of body for example.

like image 179
Matt Mullens Avatar answered Sep 20 '22 17:09

Matt Mullens