Trying to parse an HTML document and extract some elements (any links to text files).
The current strategy is to load an HTML document into a string. Then find all instances of links to text files. It could be any file type, but for this question, it's a text file.
The end goal is to have an IEnumerable
list of string objects. That part is easy, but parsing the data is the question.
<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href="http://myServer.com/blah.txt"></div>
<span>Here is your second text file: <a href="http://myServer.com/blarg2.txt"></span>
<div>Here is your third text file: <a href="http://myServer.com/bat.txt"></div>
<div>Here is your fourth text file: <a href="http://myServer.com/somefile.txt"></div>
<div>Thanks for visiting!</div>
</body>
</html>
The initial approaches are:
href=
, and ending with .txt
The question being:
Here's a C# console app using the regex suggested by Jeff. It reads the string fine, and will not include any href that is not ended with .txt. With the given sample, it correctly does NOT include the .txt.snarg
file in the results (as provided in the HTML string function).
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
namespace ParsePageLinks
{
class Program
{
static void Main(string[] args)
{
GetAllLinksFromStringByRegex();
}
static List<string> GetAllLinksFromStringByRegex()
{
string myHtmlString = BuildHtmlString();
string txtFileExp = "href=\"([^\\\"]*\\.txt)\"";
List<string> foundTextFiles = new List<string>();
MatchCollection textFileLinkMatches = Regex.Matches(myHtmlString, txtFileExp, RegexOptions.IgnoreCase);
foreach (Match m in textFileLinkMatches)
{
foundTextFiles.Add( m.Groups[1].ToString()); // this is your captured group
}
return files;
}
static string BuildHtmlString()
{
return new StringReader(@"<html><head><title>Blah</title></head><body><br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt.snarg""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div></body></html>").ReadToEnd();
}
}
}
Admittedly, a regular expression is not the first choice to correctly parse HTML, because there are some common mistakes such as missing closing tags, mismatching some tags, etc. when parsing HTML with regular expression.
This is an excerpt from Wikipedia used to define the regular expression. What you can do with RegEx? Regular expressions can be used to match HTML tags and extract the data in HTML documents. HTML is virtually composed of strings, and what makes regular expression so powerful is, a regular expression can match different strings.
There is also an extension to integrate scripting in the contest of parsing HTML documents: both C# and JavaScript, based on Jint. Which means that you can parse HTML documents after they have been modified by JavaScript. Both the JavaScript included in the page or a script you add yourself.
Every programmer or someone who wants to extract web data is strongly recommended to learn regular expressions because this tool improves your working efficiency and productivity. < (\S*?) [^>]*>.*?</\1>|<.*?/>
Neither. Load it into an (X/HT)MLDocument and use XPath, which is a standard method of manipulating XML and very powerful. The functions to look at are SelectNodes and SelectSingleNode.
Since you are apparently using HTML (not XHTML), you should use HTML Agility Pack. Most of the methods and properties match the related XML classes.
Sample implementation using XPath:
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(@"<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div>
</body>
</html>"));
HtmlNode root = doc.DocumentNode;
// 3 = ".txt".Length - 1. See http://stackoverflow.com/questions/402211/how-to-use-xpath-function-in-a-xpathexpression-instance-programatically
HtmlNodeCollection links = root.SelectNodes("//a[@href['.txt' = substring(., string-length(.)- 3)]]");
IList<string> fileStrings;
if(links != null)
{
fileStrings = new List<string>(links.Count);
foreach(HtmlNode link in links)
fileStrings.Add(link.GetAttributeValue("href", null));
}
else
fileStrings = new List<string>(0);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With