Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match the path before the resource from a URL

Tags:

c#

regex

url

Just so everybody understand the vocabulary involved, the general structure of a URL is as follows:

  http   ://   www.a.com   /  path/to/resource.html  ?  query=value  #  fragment
{scheme} ://  {authority}  /         {path}          ?   {query}     #  {fragment}

The path consists of a path and a resource, in the case of path/to/resource.html the path is path/to/ and the resource is resource.html.

Poor, Nasty and Brutish:
HTML, as it is found in the wild, can be poor, nasty and brutish, though quite often far from short. In this poor, nasty and brutish world happen to live links, which in themselves can be poor, nasty and brutish, despite the fact that URLs are supposed to adhere to the standards. So with this in mind, I present you the problem...

Problem:

I'm trying to create a regex to remove the resource from a URL's path, which is necessary when there is a link within a web page that is a relative path. For example:

  1. I visit www.domain.com/path/to/page1.html.
  2. There is a relative link to /page2.html
  3. Remove the /page1.html from the URL
  4. Append /page2.html to www.domain.com/path/to

Result: in www.domain.com/path/to/page2.html

I'm stuck on step 3!

I've isolated the path and resource, but now I want to separate the two. The regex I tried to come up with looks like this: \z([^\/]\.[^\/])

In C# the same regex is: "\\z([^/]\\.[^/])"

Translated in English, the regex is supposed to mean: match the end of the string which includes all characters separated by a dot as long as those characters are not slashes.

I tried that regular expression, but currently it fails miserably. What is the proper query to achieve the said result.

Here are some sample cases:

/path/to/resource.html => /path/to/ and resource.html
/pa.th/to/resource.html => /pa.th/to/ and resource.html
/path/to/resource.html/ => /path/to/resource.html/
/*I#$>/78zxdc.78&(!~ => /*I#$>/ and 78zxdc.78&(!~

Thanks for your help!

like image 627
Kiril Avatar asked Jan 20 '23 14:01

Kiril


1 Answers

System.Uri

var uri = new Uri("http://www.domain.com/path/to/page1.html?query=value#fragment");

Console.WriteLine(uri.Scheme); // http
Console.WriteLine(uri.Host); // www.domain.com
Console.WriteLine(uri.AbsolutePath); // /path/to/page1.html
Console.WriteLine(uri.PathAndQuery); // /path/to/page1.html?query=value
Console.WriteLine(uri.Query); // ?query=value
Console.WriteLine(uri.Fragment); // #fragment
Console.WriteLine(uri.Segments[uri.Segments.Length - 1]); // page1.html

for (var i = 0 ; i < uri.Segments.Length ; i++)
{
    Console.WriteLine("{0}: {1}", i, uri.Segments[i]);
    /*
    Output
    0: /
    1: path/
    2: to/
    3: page1.html
    */
}
like image 133
amit_g Avatar answered Feb 02 '23 11:02

amit_g