Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML to get content using C#

I am writing an application that crawls a group of my web pages. Rather than take the entire source code of the page I'd like to take all of the content and store that and be able to store the page as plain text within a database. The content will be used in other applications and not read by users so there's no need for it to be perfectly human-readable.

At first, I was thinking of using regular expressions, but I have no control over the validity of the web pages and there is a great chance that no regular expression would give me the content.

If I have the source code within a string, how can I turn that string of source code into just the content in C#?

like image 619
Mike B Avatar asked Jan 10 '10 18:01

Mike B


3 Answers

It isn't 100% clear what you want, but I'm assuming you want the text minus markup; so:

string html;
// obtain some arbitrary html....
using (var client = new WebClient()) {
    html = client.DownloadString("http://stackoverflow.com/questions/2038104");
}
// use the html agility pack: http://www.codeplex.com/htmlagilitypack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()")) {
    sb.AppendLine(node.Text);
}
string final = sb.ToString();
like image 174
Marc Gravell Avatar answered Oct 21 '22 00:10

Marc Gravell


Please, please do not parse HTML yourself! You cannot use just a standard regex to parse HTML - it's not possible.

There are tons of free libraries out there. One of the best free ones in the world of .NET is the HTML Agility Pack.

HTML Agility Pack supports malformed documents as well, which is something that a regex or other basic parsing such as XML will almost never do.

like image 38
Eilon Avatar answered Oct 20 '22 23:10

Eilon


Below function will help to remove all HTML tags, scripts, css, styles from html string and convert it to a plain text. view source

private string GetPlainTextFromHtml(string htmlString)
{
    string htmlTagPattern = "<.*?>";
    var regexCss = new Regex("(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)", RegexOptions.Singleline | RegexOptions.IgnoreCase);
    htmlString = regexCss.Replace(htmlString, string.Empty);
    htmlString = Regex.Replace(htmlString, htmlTagPattern, string.Empty);
    htmlString = Regex.Replace(htmlString, @"^\s+$[\r\n]*", "", RegexOptions.Multiline);
    htmlString = htmlString.Replace("&nbsp;", string.Empty);

    return htmlString;
}
like image 43
alin0509 Avatar answered Oct 20 '22 23:10

alin0509