Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML links using C#

Tags:

html

c#

.net

Is there a built in dll that will give me a list of links from a string. I want to send in a string with valid html and have it parse all the links. I seem to remember there being something built into either .net or an unmanaged library.

I found a couple open source projects that looked promising but I thought there was a built in module. If not I may have to use one of those. I just didn't want an external dependency at this point if it wasn't necessary.

like image 508
Shaun Bowe Avatar asked Sep 23 '08 18:09

Shaun Bowe


2 Answers

I'm not aware of anything built in and from your question it's a little bit ambiguous what you're looking for exactly. Do you want the entire anchor tag, or just the URL from the href attribute?

If you have well-formed XHtml, you might be able to get away with using an XmlReader and an XPath query to find all the anchor tags (<a>) and then hit the href attribute for the address. Since that's unlikely, you're probably better off using RegEx to pull down what you want.

Using RegEx, you could do something like:

List<Uri> findUris(string message)
{
    string anchorPattern = "<a[\\s]+[^>]*?href[\\s]?=[\\s\\\"\']+(?<href>.*?)[\\\"\\']+.*?>(?<fileName>[^<]+|.*?)?<\\/a>";
    MatchCollection matches = Regex.Matches(message, anchorPattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.Compiled);
    if (matches.Count > 0)
    {
        List<Uri> uris = new List<Uri>();

        foreach (Match m in matches)
        {
            string url = m.Groups["url"].Value;
            Uri testUri = null;
            if (Uri.TryCreate(url, UriKind.RelativeOrAbsolute, out testUri))
            {
                uris.Add(testUri);
            }
        }
        return uris;
    }
    return null;
}

Note that I'd want to check the href to make sure that the address actually makes sense as a valid Uri. You can eliminate that if you aren't actually going to be pursuing the link anywhere.

like image 197
Jacob Proffitt Avatar answered Nov 10 '22 05:11

Jacob Proffitt


I don't think there is a built-in library, but the Html Agility Pack is popular for what you want to do.

The way to do this with the raw .NET framework and no external dependencies would be use a regular expression to find all the 'a' tags in the string. You would need to take care of a lot of edge cases perhaps. eg href = "http://url" vs href=http://url etc.

like image 40
Brian Lyttle Avatar answered Nov 10 '22 05:11

Brian Lyttle