Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath/HtmlAgilityPack: How to find an element (a) with a specific value for an attribute (href) and find adjacent table columns?

I'm pretty desperate because I can't figure out how to achieve what I stated in the question. I've already read countless of similar examples but didn't find one which works in exact situation. So, let's say I have the following code:

<table><tr>
<td><a href="url-a">text A</a></td><td><a>id A</a></td><td><a>img A</a></td>
<td><a href="url-b">text B</a></td><td><a>id B</a></td><td><a>img B</a></td>
<td><a href="url-c">text C</a></td><td><a>id C</a></td><td><a>img C</a></td>
</tr></table>

Now, what I already have is a part of url-a. I basically want to know how I can get id A and img A. I'm trying to "find" the line with XPath but I can't work out a way to make it work. Also, it might be possible that the information is not present at all. This is my latest try (seriously, I've tinkered with this for more than 3 hours now trying numerous different ways):

if (htmlDoc.DocumentNode.SelectSingleNode(@"/a[contains(@href, 'part-url-a')]") != null)
    string ida = htmlDoc.DocumentNode.SelectSingleNode(@"/a[contains(@href, 'part-url-a')]/following-sibling::a").InnerText;

Well, it's apparently wrong as hell so I'd be very pleased if someone could help me out here. Also I'd appreciate it if someone could point me to some Website which explains XPath and the notations/Syntax in detail with examples like this one. Books also welcome.

PS: I know I could achieve my goal without XPath at all too with Regex or just a simple StreamReader in C# and checking if each line contains what I need but a) it's too fragile for my needs because the code might have abrupt line-breaks and b) I really want to stay consistend with sticking completely to XPath for anything I'm doing in this project.

Thanks in advance for your help!

like image 557
Gernony Avatar asked Sep 03 '11 19:09

Gernony


2 Answers

Use the following XPath expressions:

   /*/tr/td[a[@href='url-a']]
                /following-sibling::td[1]
                     /a/text()

When evaluated against the provided (malformed but corrected) XML document:

<table><tr>
<td><a href="url-a">text A</a></td><td><a>id A</a></td><td><a>img A</a></td>
<td><a href="url-b">text B</a></td><td><a>id B</a></td><td><a>img B</a></td>
<td><a href="url-c">text C</a></td><td><a>id C</a></td><td><a>img C</a></td>
</tr></table>

the wanted text node is selected:

id A

Similarly, this XPath expression:

   /*/tr/td[a[@href='url-a']]
                /following-sibling::td[2]
                     /a/text()

when evaluated against the same XML document (above), selects the other wanted text node:

img A

XSLT-based verification:

When this transformation is applied on the XML document (above):

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "/*/tr/td[a[@href='url-a']]
                /following-sibling::td[1]
                     /a/text()"/>

  <xsl:text>&#10;</xsl:text>
  <xsl:copy-of select=
   "/*/tr/td[a[@href='url-a']]
                /following-sibling::td[2]
                     /a/text()"/>
 </xsl:template>
</xsl:stylesheet>

the wanted results are produced:

id A
img A
like image 197
Dimitre Novatchev Avatar answered Sep 28 '22 18:09

Dimitre Novatchev


You have a seriously broken HTML with unmatching closing td tags. Fix them please. It's just an ugly picture this markup.

This being said hopefully Html Agility Pack can handle any crap that you throw at it, so here's how to proceed and parse the junk you have and find the id and img values given the href:

class Program
{
    static void Main()
    {
        var doc = new HtmlDocument();
        doc.Load("test.html");
        var anchor = doc.DocumentNode.SelectSingleNode("//a[contains(@href, 'url-a')]");
        if (anchor != null)
        {
            var id = anchor.ParentNode.SelectSingleNode("following-sibling::td/a");
            if (id != null)
            {
                Console.WriteLine(id.InnerHtml);
                var img = id.ParentNode.SelectSingleNode("following-sibling::td/a");
                if (img != null)
                {
                    Console.WriteLine(img.InnerHtml);
                }
            }
        }
    }
}
like image 29
Darin Dimitrov Avatar answered Sep 28 '22 20:09

Darin Dimitrov