Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing dl with HtmlAgilityPack

This is the sample HTML I am trying to parse with Html Agility Pack in ASP.Net (C#).

<div class="content-div">
    <dl>
        <dt>
            <b><a href="1.html" title="1">1</a></b>
        </dt>
        <dd> First Entry</dd>
        <dt>
            <b><a href="2.html" title="2">2</a></b>
        </dt>
        <dd> Second Entry</dd>
        <dt>
            <b><a href="3.html" title="3">3</a></b>
        </dt>
        <dd> Third Entry</dd>
    </dl>
</div>

The Values I want are :

  • The hyperlink -> 1.html
  • The Anchor Text ->1
  • Inner Text od dd -> First Entry

(I have taken examples of the first entry here but I want the values for these elements for all the entries in the list )

This is the code I am using currently,

var webGet = new HtmlWeb();
            var document = webGet.Load(url2);
var parsedValues=
   from info in document.DocumentNode.SelectNodes("//div[@class='content-div']")
   from content in info.SelectNodes("dl//dd")
   from link in info.SelectNodes("dl//dt/b/a")
       .Where(x => x.Attributes.Contains("href"))
   select new 
   {
       Text = content.InnerText,
       Url = link.Attributes["href"].Value,
       AnchorText = link.InnerText,
   };

GridView1.DataSource = parsedValues;
GridView1.DataBind();

The problem is that I get the values for the link and the anchor text correctly but for the inner text of it just takes the value of the first entry and fills the same value for all other entries for the total number of times the element occurs and then it starts over with the second one. I may not be so clear in my explanation so here's a sample output I am getting with this code:

First Entry     1.html  1
First Entry     2.html  2
First Entry     3.html  3
Second Entry    1.html  1
Second Entry    2.html  2
Second Entry    3.html  3
Third Entry     1.html  1
Third Entry     2.html  2
Third Entry     3.html  3

Whereas I am trying to get

First Entry      1.html     1
Second Entry     2.html     2
Third Entry      3.html     3

I am pretty new to HAP and have very little knoweledge on xpath, so I am sure I am doing something wrong here, but I couldn't make it work even after spending hours on it. Any help would be much appreciated.

like image 469
redGREENblue Avatar asked Jan 20 '12 14:01

redGREENblue


1 Answers

Solution 1

I have defined a function that given a dt node will return the next dd node after it:

private static HtmlNode GetNextDDSibling(HtmlNode dtElement)
{
    var currentNode = dtElement;

    while (currentNode != null)
    {
        currentNode = currentNode.NextSibling;

        if(currentNode.NodeType == HtmlNodeType.Element && currentNode.Name =="dd")
            return currentNode;
    }

    return null;
}

and now the LINQ code can be transformed to:

var parsedValues =
    from info in document.DocumentNode.SelectNodes("//div[@class='content-div']")
    from dtElement in info.SelectNodes("dl/dt")
    let link = dtElement.SelectSingleNode("b/a[@href]")
    let ddElement = GetNextDDSibling(dtElement)
    where link != null && ddElement != null
    select new
    {
        Text = ddElement.InnerHtml,
        Url = link.GetAttributeValue("href", ""),
        AnchorText = link.InnerText
    };

Solution 2

Without additional functions:

var infoNode = 
        document.DocumentNode.SelectSingleNode("//div[@class='content-div']");

var dts = infoNode.SelectNodes("dl/dt");
var dds = infoNode.SelectNodes("dl/dd");

var parsedValues = dts.Zip(dds,
    (dt, dd) => new
    {
        Text = dd.InnerHtml,
        Url = dt.SelectSingleNode("b/a[@href]").GetAttributeValue("href", ""),
        AnchorText = dt.SelectSingleNode("b/a[@href]").InnerText
    });
like image 130
Cristian Lupascu Avatar answered Oct 17 '22 19:10

Cristian Lupascu