I'm pretty desperate because I can't figure out how to achieve what I stated in the question. I've already read countless of similar examples but didn't find one which works in exact situation. So, let's say I have the following code:
<table><tr>
<td><a href="url-a">text A</a></td><td><a>id A</a></td><td><a>img A</a></td>
<td><a href="url-b">text B</a></td><td><a>id B</a></td><td><a>img B</a></td>
<td><a href="url-c">text C</a></td><td><a>id C</a></td><td><a>img C</a></td>
</tr></table>
Now, what I already have is a part of url-a. I basically want to know how I can get id A and img A. I'm trying to "find" the line with XPath but I can't work out a way to make it work. Also, it might be possible that the information is not present at all. This is my latest try (seriously, I've tinkered with this for more than 3 hours now trying numerous different ways):
if (htmlDoc.DocumentNode.SelectSingleNode(@"/a[contains(@href, 'part-url-a')]") != null)
string ida = htmlDoc.DocumentNode.SelectSingleNode(@"/a[contains(@href, 'part-url-a')]/following-sibling::a").InnerText;
Well, it's apparently wrong as hell so I'd be very pleased if someone could help me out here. Also I'd appreciate it if someone could point me to some Website which explains XPath and the notations/Syntax in detail with examples like this one. Books also welcome.
PS: I know I could achieve my goal without XPath at all too with Regex or just a simple StreamReader in C# and checking if each line contains what I need but a) it's too fragile for my needs because the code might have abrupt line-breaks and b) I really want to stay consistend with sticking completely to XPath for anything I'm doing in this project.
Thanks in advance for your help!
Use the following XPath expressions:
/*/tr/td[a[@href='url-a']]
/following-sibling::td[1]
/a/text()
When evaluated against the provided (malformed but corrected) XML document:
<table><tr>
<td><a href="url-a">text A</a></td><td><a>id A</a></td><td><a>img A</a></td>
<td><a href="url-b">text B</a></td><td><a>id B</a></td><td><a>img B</a></td>
<td><a href="url-c">text C</a></td><td><a>id C</a></td><td><a>img C</a></td>
</tr></table>
the wanted text node is selected:
id A
Similarly, this XPath expression:
/*/tr/td[a[@href='url-a']]
/following-sibling::td[2]
/a/text()
when evaluated against the same XML document (above), selects the other wanted text node:
img A
XSLT-based verification:
When this transformation is applied on the XML document (above):
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/tr/td[a[@href='url-a']]
/following-sibling::td[1]
/a/text()"/>
<xsl:text> </xsl:text>
<xsl:copy-of select=
"/*/tr/td[a[@href='url-a']]
/following-sibling::td[2]
/a/text()"/>
</xsl:template>
</xsl:stylesheet>
the wanted results are produced:
id A
img A
You have a seriously broken HTML with unmatching closing td
tags. Fix them please. It's just an ugly picture this markup.
This being said hopefully Html Agility Pack can handle any crap that you throw at it, so here's how to proceed and parse the junk you have and find the id
and img
values given the href
:
class Program
{
static void Main()
{
var doc = new HtmlDocument();
doc.Load("test.html");
var anchor = doc.DocumentNode.SelectSingleNode("//a[contains(@href, 'url-a')]");
if (anchor != null)
{
var id = anchor.ParentNode.SelectSingleNode("following-sibling::td/a");
if (id != null)
{
Console.WriteLine(id.InnerHtml);
var img = id.ParentNode.SelectSingleNode("following-sibling::td/a");
if (img != null)
{
Console.WriteLine(img.InnerHtml);
}
}
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With