Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing html with the HTML Agility Pack and Linq

I have the following HTML

(..)
<tbody>
 <tr>
  <td class="name"> Test1 </td>
  <td class="data"> Data </td>
  <td class="data2"> Data 2 </td>
 </tr>
 <tr>
  <td class="name"> Test2 </td>
  <td class="data"> Data2 </td>
  <td class="data2"> Data 2 </td>
 </tr>
</tbody>
(..)

The information I have is the name => so "Test1" & "Test2". What I want to know is how can I get the data that's in "data" and "data2" based on the Name I have.

Currently I'm using:

var data =
    from
        tr in doc.DocumentNode.Descendants("tr")
    from   
        td in tr.ChildNodes.Where(x => x.Attributes["class"].Value == "name")
    where
        td.InnerText == "Test1"
    select tr;

But I get {"Object reference not set to an instance of an object."} when I try to look in data

like image 481
Timo Willemsen Avatar asked Jan 06 '11 15:01

Timo Willemsen


2 Answers

As for your attempt, you have two issues with your code:

  1. ChildNodes is weird - it also returns whitespace text nodes, which don't have a class attributes (can't have attributes, of course).
  2. As James Walford commented, the spaces around the text are significant, you probably want to trim them.

With these two corrections, the following works:

var data =
      from tr in doc.DocumentNode.Descendants("tr")
      from td in tr.Descendants("td").Where(x => x.Attributes["class"].Value == "name")
     where td.InnerText.Trim() == "Test1"
    select tr;
like image 54
Kobi Avatar answered Oct 07 '22 18:10

Kobi


Here is the XPATH way - hmmm... everyone seems to have forgotten about the power XPATH and concentrate exclusively on C# XLinq, these days :-)

This function gets all data values associated with a name:

public static IEnumerable<string> GetData(HtmlDocument document, string name)
{
    return from HtmlNode node in
        document.DocumentNode.SelectNodes("//td[@class='name' and contains(text(), '" + name + "')]/following-sibling::td")
        select node.InnerText.Trim();
}

For example, this code will dump all 'Test2' data:

    HtmlDocument doc = new HtmlDocument();
    doc.Load(yourHtml);

    foreach (string data in GetData(doc, "Test2"))
    {
        Console.WriteLine(data);
    }
like image 40
Simon Mourier Avatar answered Oct 07 '22 16:10

Simon Mourier