Parsing HTML document: Regular expression or LINQ?

Tags:

Trying to parse an HTML document and extract some elements (any links to text files).

The current strategy is to load an HTML document into a string. Then find all instances of links to text files. It could be any file type, but for this question, it's a text file.

The end goal is to have an IEnumerable list of string objects. That part is easy, but parsing the data is the question.

<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href="http://myServer.com/blah.txt"></div>
<span>Here is your second text file: <a href="http://myServer.com/blarg2.txt"></span>
<div>Here is your third text file: <a href="http://myServer.com/bat.txt"></div>
<div>Here is your fourth text file: <a href="http://myServer.com/somefile.txt"></div>
<div>Thanks for visiting!</div>
</body>
</html>

The initial approaches are:

load the string into an XML document, and attack it in a Linq-To-Xml fashion.
create a regex, to look for a string starting with href=, and ending with .txt

The question being:

what would that regex look like? I am a regex newbie, and this is part of my regex learning.
which method would you use to extract a list of tags?
which would be the most performant way?
which method would be the most readable/maintainable?

Update: Kudos to Matthew on the HTML Agility Pack suggestion. It worked just fine! The XPath suggestion works as well. I wish I could mark both answers as 'The Answer', but I obviously cannot. They are both valid solutions to the problem.

Here's a C# console app using the regex suggested by Jeff. It reads the string fine, and will not include any href that is not ended with .txt. With the given sample, it correctly does NOT include the .txt.snarg file in the results (as provided in the HTML string function).

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace ParsePageLinks
{
    class Program
    {
        static void Main(string[] args)
        {
            GetAllLinksFromStringByRegex();
        }

        static List<string> GetAllLinksFromStringByRegex()
        {
            string myHtmlString = BuildHtmlString();
            string txtFileExp = "href=\"([^\\\"]*\\.txt)\"";

            List<string> foundTextFiles = new List<string>();

            MatchCollection textFileLinkMatches = Regex.Matches(myHtmlString, txtFileExp, RegexOptions.IgnoreCase);
            foreach (Match m in textFileLinkMatches)
            {
                foundTextFiles.Add( m.Groups[1].ToString()); // this is your captured group
            }

            return files;
        }

            static string BuildHtmlString()
            {
                return new StringReader(@"<html><head><title>Blah</title></head><body><br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt.snarg""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div></body></html>").ReadToEnd();
            }       
        }
    }

717

asked May 25 '09 17:05

p.campbell

1 Answers

Neither. Load it into an (X/HT)MLDocument and use XPath, which is a standard method of manipulating XML and very powerful. The functions to look at are SelectNodes and SelectSingleNode.

Since you are apparently using HTML (not XHTML), you should use HTML Agility Pack. Most of the methods and properties match the related XML classes.

Sample implementation using XPath:

    HtmlDocument doc = new HtmlDocument();
    doc.Load(new StringReader(@"<html>
<head><title>Blah</title>
</head>
<body>
<br/>
<div>Here is your first text file: <a href=""http://myServer.com/blah.txt""></div>
<span>Here is your second text file: <a href=""http://myServer.com/blarg2.txt""></span>
<div>Here is your third text file: <a href=""http://myServer.com/bat.txt""></div>
<div>Here is your fourth text file: <a href=""http://myServer.com/somefile.txt""></div>
<div>Thanks for visiting!</div>
</body>
</html>"));
        HtmlNode root = doc.DocumentNode;
        // 3 = ".txt".Length - 1.  See http://stackoverflow.com/questions/402211/how-to-use-xpath-function-in-a-xpathexpression-instance-programatically
        HtmlNodeCollection links = root.SelectNodes("//a[@href['.txt' = substring(., string-length(.)- 3)]]");
    IList<string> fileStrings;
    if(links != null)
    {
        fileStrings = new List<string>(links.Count);
        foreach(HtmlNode link in links)
        fileStrings.Add(link.GetAttributeValue("href", null));
    }
    else
        fileStrings = new List<string>(0);

137

answered Nov 11 '22 18:11

Matthew Flaschen

Related questions
                            
                                Detecting if SQL server is running
                            
                                Windows Form with Resizing Frame and no Title Bar?
                            
                                How do I check if the scanner is plugged in (C#, .NET TWAIN)
                            
                                Call a certain method before each webservice call
                            
                                Recommended data format for describing the rules of chess
                            
                                How to show compulsory fields on a windows form
                            
                                Send document to printer with C#
                            
                                always try-catch external resource calls?
                            
                                Convert Pixels to Inches and vice versa in C#
                            
                                Exception error message with incorrect line number
                            
                                IronPython vs. C# for small-scale projects
                            
                                How to trigger an Control.Resize event without actually resizing?
                            
                                What is the best way to make a graph in WPF? (or in general that would apply to WPF as well)
                            
                                C# interface question
                            
                                What is the SQL Server CLR Integration Life Cycle?
                            
                                Get the exact url the user typed into the browser
                            
                                What are some examples of how anonymous types are useful?
                            
                                Dynamic Form Generation in ASP.NET
                            
                                Multiple storyboards on one property
                            
                                MVVM Pattern, ViewModel DataContext question

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing HTML document: Regular expression or LINQ?

Tags:

c#

regex

parsing

linq

linq-to-xml

p.campbell

People also ask

1 Answers

Matthew Flaschen

Recent Activity

Donate For Us