Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Convert Relative to Absolute Links in HTML String

I'm mirroring some internal websites for backup purposes. As of right now I basically use this c# code:

System.Net.WebClient client = new System.Net.WebClient();
byte[] dl = client.DownloadData(url);

This just basically downloads the html and into a byte array. This is what I want. The problem however is that the links within the html are most of the time relative, not absolute.

I basically want to append whatever the full http://domain.is before the relative link as to convert it to an absolute link that will redirect to the original content. I'm basically just concerned with href= and src=. Is there a regex expression that will cover some of the basic cases?

Edit [My Attempt]:

public static string RelativeToAbsoluteURLS(string text, string absoluteUrl)
{
    if (String.IsNullOrEmpty(text))
    {
        return text;
    }

    String value = Regex.Replace(
        text, 
        "<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>", 
        "<$1$2=\"" + absoluteUrl + "$3\"$4>", 
        RegexOptions.IgnoreCase | RegexOptions.Multiline);

    return value.Replace(absoluteUrl + "/", absoluteUrl);
}
like image 692
Gary Avatar asked Oct 01 '10 04:10

Gary


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.

Is C language easy?

C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.

What is C in C language?

What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.


1 Answers

The most robust solution would be to use the HTMLAgilityPack as others have suggested. However a reasonable solution using regular expressions is possible using the Replace overload that takes a MatchEvaluator delegate, as follows:

var baseUri = new Uri("http://test.com");
var pattern = @"(?<name>src|href)=""(?<value>/[^""]*)""";
var matchEvaluator = new MatchEvaluator(
    match =>
    {
        var value = match.Groups["value"].Value;
        Uri uri;

        if (Uri.TryCreate(baseUri, value, out uri))
        {
            var name = match.Groups["name"].Value;
            return string.Format("{0}=\"{1}\"", name, uri.AbsoluteUri);
        }

        return null;
    });
var adjustedHtml = Regex.Replace(originalHtml, pattern, matchEvaluator);

The above sample searches for attributes named src and href that contain double quoted values starting with a forward slash. For each match, the static Uri.TryCreate method is used to determine if the value is a valid relative uri.

Note that this solution doesn't handle single quoted attribute values and certainly doesn't work on poorly formed HTML with unquoted values.

like image 129
Nathan Baulch Avatar answered Oct 20 '22 04:10

Nathan Baulch