Here are a few URLs:
http://sub.example.com/?feed=atom&hello=world
http://www.sub.example.com/?feed=atom&hello=world
http://sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom#123
As you can see, they all lead to the exact same page but the URL format is different. Here is two other basic examples:
http://example.com/hello/
http://example.com/hello
Both are the same.
I want to convert the URL into one standard format so that when I store the URL in the database, I can easily check whether if the URL string already exists in the database.
Because of the various ways of how the URL can be formatted, this can be puzzling.
What's the definitive approach to converting URL into one standard format? Maybe parse_url()
route...?
As outlined in the comments, there is no definitive solution to this, but the aim is to get as close as possible with what we have without "retrieving" the page. Please read comments before posting an answer to this bounty.
Most web browsers display the URL of a web page above the page in an address bar. A typical URL could have the form http://www.example.com/index.html , which indicates a protocol ( http ), a hostname ( www.example.com ), and a file name ( index. html ).
Select the text that you want to turn into a hyperlink, and right-click it. On the shortcut menu, click Hyperlink. In the Insert Hyperlink dialog, paste the link in the Address box and click OK.
The <a> HTML tag is used to define hyperlinks: <a href=”URL”>link text/a>. If you have a text list of URLs, you can use this online hyperlink generator free to convert them to HTML hyperlinks automatically. This tool will convert the list of web addresses to create a hyperlink.
A URL consists of five parts: the scheme, subdomain, top-level domain, second-level domain, and subdirectory. Below is an illustration of the different parts of a URL.
After you parse_url
:
www
prefix from the domain name
Combine these parts in order to get a canonical URL.
I had the same issue for a reports-configuration-save functionality. In our system, users can design his own reports of sales (like JQL of Jira); for that, we use get params as conditions, and fragment identifier (after #) as layout setup, like this:
http://example.com/report.php?since=20180101&until=20180806#sort=amount&color=blue
For our system, order of GET or after # params are irrelevant as well you reach the same report configuration if set param "until" first than "since", so for us are the same request.
Considering this, subdomains are out of discussion, cause you must solve this using rewrite techniques (like mod_rewrite with 301 in Apache) or create a pool of domain exceptions to do this at software level. Also, different domains can point into different websites, so you must decide if is a good idea; in subdos "www" is very easy to figured it out, but it will toke you time in another cases.
Server side can help to get vars in query section. For example, in PHP you can use function parse_str and $_SERVER['QUERY_STRING'] to get array, and then, you will need use asort() to order it to finnaly compare if are the same request (array_diff function).
Unfortunately, server side is not an option since have no capability to get after hash (#) content, and we still without consider another problems, like scriptname included, protocols or ports:
http://www.sub.example.com/index.php?hello=world&feed=atom
https://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com:8081/?hello=world&feed=atom
In my personal experience, the most close solution is JavaScript, for handling url, parsing query section as array, compare them and do the same with fragment identifier. If you need to use it in server side, every load page will must be followed with an ajax request sending this data to the server.
Apologies in advance for length of my answer, but it is what I had to go through in order to solve the same problems you have. Greetings!
Get protocol, domain, and port from URL Get protocol, domain, and port from URL
How can I get query string values in JavaScript? How can I get query string values in JavaScript?
How do I get the fragment identifier (value after hash #) from a URL? How do I get the fragment identifier (value after hash #) from a URL?
adding the preferred <link rel="canonical" ... >
tag into the HTML headers is the only reliable solution, in order to reference unique content to a single SEF URL
. see Google's documentation, concerning Consolidate duplicate URLs, which possibly answers the whole question more autoritative and reliable, than I ever could.
the idea of being able to know of the canonical URL or to resolve a bunch externals URLs, without parsing those server's .htaccess
rewrite-rules or the HTML headers, does not appear to be applicable (simply because one can maintain a table with URL aliases, which subsequently do not permit guessing how a HTTP request might have been re-written).
this question might belong to https://webmasters.stackexchange.com/search?q=cannonical.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With