Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert URL into one standard format

Tags:

url

php

Here are a few URLs:

http://sub.example.com/?feed=atom&hello=world
http://www.sub.example.com/?feed=atom&hello=world
http://sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom#123

As you can see, they all lead to the exact same page but the URL format is different. Here is two other basic examples:

http://example.com/hello/
http://example.com/hello

Both are the same.

I want to convert the URL into one standard format so that when I store the URL in the database, I can easily check whether if the URL string already exists in the database.

Because of the various ways of how the URL can be formatted, this can be puzzling.

What's the definitive approach to converting URL into one standard format? Maybe parse_url() route...?

Edit

As outlined in the comments, there is no definitive solution to this, but the aim is to get as close as possible with what we have without "retrieving" the page. Please read comments before posting an answer to this bounty.

like image 970
Henrik Petterson Avatar asked Aug 04 '18 13:08

Henrik Petterson


People also ask

What is the standard URL format?

Most web browsers display the URL of a web page above the page in an address bar. A typical URL could have the form http://www.example.com/index.html , which indicates a protocol ( http ), a hostname ( www.example.com ), and a file name ( index. html ).

How do I turn a URL into a hyperlink?

Select the text that you want to turn into a hyperlink, and right-click it. On the shortcut menu, click Hyperlink. In the Insert Hyperlink dialog, paste the link in the Address box and click OK.

How do I turn a URL into a link in HTML?

The <a> HTML tag is used to define hyperlinks: <a href=”URL”>link text/a>. If you have a text list of URLs, you can use this online hyperlink generator free to convert them to HTML hyperlinks automatically. This tool will convert the list of web addresses to create a hyperlink.

How is a URL structured?

A URL consists of five parts: the scheme, subdomain, top-level domain, second-level domain, and subdirectory. Below is an illustration of the different parts of a URL.


3 Answers

After you parse_url:

  1. Remove the www prefix from the domain name
  2. If the path is not empty - remove the trailing slash from it
  3. Sort query parameters alphabetically by their name - if there are any

Combine these parts in order to get a canonical URL.

like image 104
IVO GELOV Avatar answered Oct 23 '22 22:10

IVO GELOV


I had the same issue for a reports-configuration-save functionality. In our system, users can design his own reports of sales (like JQL of Jira); for that, we use get params as conditions, and fragment identifier (after #) as layout setup, like this:

http://example.com/report.php?since=20180101&until=20180806#sort=amount&color=blue

For our system, order of GET or after # params are irrelevant as well you reach the same report configuration if set param "until" first than "since", so for us are the same request.

Considering this, subdomains are out of discussion, cause you must solve this using rewrite techniques (like mod_rewrite with 301 in Apache) or create a pool of domain exceptions to do this at software level. Also, different domains can point into different websites, so you must decide if is a good idea; in subdos "www" is very easy to figured it out, but it will toke you time in another cases.

Server side can help to get vars in query section. For example, in PHP you can use function parse_str and $_SERVER['QUERY_STRING'] to get array, and then, you will need use asort() to order it to finnaly compare if are the same request (array_diff function).

Unfortunately, server side is not an option since have no capability to get after hash (#) content, and we still without consider another problems, like scriptname included, protocols or ports:

http://www.sub.example.com/index.php?hello=world&feed=atom
https://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com:8081/?hello=world&feed=atom

In my personal experience, the most close solution is JavaScript, for handling url, parsing query section as array, compare them and do the same with fragment identifier. If you need to use it in server side, every load page will must be followed with an ajax request sending this data to the server.

Apologies in advance for length of my answer, but it is what I had to go through in order to solve the same problems you have. Greetings!

Get protocol, domain, and port from URL Get protocol, domain, and port from URL

How can I get query string values in JavaScript? How can I get query string values in JavaScript?

How do I get the fragment identifier (value after hash #) from a URL? How do I get the fragment identifier (value after hash #) from a URL?

like image 34
Benjamin Avatar answered Oct 23 '22 20:10

Benjamin


adding the preferred <link rel="canonical" ... > tag into the HTML headers is the only reliable solution, in order to reference unique content to a single SEF URL. see Google's documentation, concerning Consolidate duplicate URLs, which possibly answers the whole question more autoritative and reliable, than I ever could.

the idea of being able to know of the canonical URL or to resolve a bunch externals URLs, without parsing those server's .htaccess rewrite-rules or the HTML headers, does not appear to be applicable (simply because one can maintain a table with URL aliases, which subsequently do not permit guessing how a HTTP request might have been re-written).

this question might belong to https://webmasters.stackexchange.com/search?q=cannonical.

like image 26
Martin Zeitler Avatar answered Oct 23 '22 21:10

Martin Zeitler