Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting images urls from html in c# using html agility pack and writing them in a xml file

I am new to c# and I really need help with the following problem. I wish to extract the photos urls from a webpage that have a specific pattern. For example I wish to extract all the images that have the following pattern name_412s.jpg. I use the following code to extract images from html, but I do not kow how to adapt it.

public void Images()
    {
        WebClient x = new WebClient();
        string source = x.DownloadString(@"http://www.google.com");

        HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
        document.Load(source);

        foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img")
        {
          images[] = link["src"];
       }
}

I also need to write the results in a xml file. Can you also help me with that?

Thank you !

like image 727
Cristina Ursu Avatar asked Oct 23 '22 01:10

Cristina Ursu


1 Answers

To limit the query results, you need to add a condition to your XPath. For instance, //img[contains(@src, 'name_412s.jpg')] will limit the results to only img elements that have an src attribute that contains that file name.

As far as writing out the results to XML, you'll need to create a new XML document and then copy the matching elements into it. Since you won't be able to directly import an HtmlAgilityPack node into an XmlDocument, you'll have to manually copy all the attributes. For instance:

using System.Net;
using System.Xml;

// ...

public void Images()
{
    WebClient x = new WebClient();
    string source = x.DownloadString(@"http://www.google.com");
    HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
    document.Load(source);
    XmlDocument output = new XmlDocument();
    XmlElement imgElements = output.CreateElement("ImgElements");
    output.AppendChild(imgElements);
    foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img[contains(@src, '_412s.jpg')]")
    {
        XmlElement img = output.CreateElement(link.Name);
        foreach(HtmlAttribute a in link.Attributes)
        {
            img.SetAttribute(a.Name, a.Value)
        }
        imgElements.AppendChild(img);
    }
    output.Save(@"C:\test.xml");
}
like image 193
Steven Doggart Avatar answered Oct 31 '22 02:10

Steven Doggart