Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String to HtmlDocument

Tags:

html

c#

I'm fetching the html document by URL using WebClient.DownloadString(url) but then its very hard to find the element content that I'm looking for. Whilst reading around I've spotted HtmlDocument and that it has neat things like GetElementById. How can I populate an HtmlDocument with the html returned by url?

like image 691
lappy Avatar asked Feb 08 '11 16:02

lappy


People also ask

What is Htmldocument?

It's a text document saved with the extension . html or . htm that contains texts and some tags written between "< >" which give the instructions needed to configure the web page. These tags are fixed and definite and will be currently explained in the tutorials when applied and needed.

What is HTML agility pack?

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a . NET code library that allows you to parse "out of the web" HTML files.


3 Answers

The HtmlDocument class is a wrapper around the native IHtmlDocument2 COM interface.
You cannot easily create it from a string.

You should use the HTML Agility Pack.

like image 114
SLaks Avatar answered Oct 20 '22 23:10

SLaks


Using Html Agility Pack as suggested by SLaks, this becomes very easy:

string html = webClient.DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(html);

HtmlNode specificNode = doc.GetElementById("nodeId");
HtmlNodeCollection nodesMatchingXPath = doc.DocumentNode.SelectNodes("x/path/nodes");
like image 33
Dan Tao Avatar answered Oct 21 '22 01:10

Dan Tao


To answer the original question:

HTMLDocument doc = new HTMLDocument();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(fileText);
// now use doc

Then to convert back to a string:

doc.documentElement.outerHTML;
like image 24
David Sherret Avatar answered Oct 20 '22 23:10

David Sherret