Here are a few URLs: <pre class="prettyprint"><code>http://sub.example.com/?feed=atom&hello=world http://www.sub.example.com/?feed=atom&hello=world http://sub.example.com/?hello=world&feed=atom http://www.sub.example.com/?hello=world&feed=atom http://www.sub.example.com/?hello=world&feed=atom http://www.sub.example.com/?hello=world&feed=atom#123 </code></pre> As you can see, they all lead to the exact same page but the URL format is different. Here is two other basic examples: <pre class="prettyprint"><code>http://example.com/hello/ http://example.com/hello </code></pre> Both are the same. I want to convert the URL into one standard format so that when I store the URL in the database, I can easily check whether if the URL string already exists in the database. Because of the various ways of how the URL can be formatted, this can be puzzling. What's the definitive approach to converting URL into one standard format? Maybe <code>parse_url()</code> route...? <h3>Edit</h3> As outlined in the comments, there is no definitive solution to this, but the aim is to get as close as possible with what we have without "retrieving" the page. Please read comments before posting an answer to this bounty.

After you <code>parse_url</code>: <ol> <li>Remove the <code>www</code> prefix from the domain name </li> <li>If the path is not empty - remove the trailing slash from it</li> <li>Sort query parameters alphabetically by their name - if there are any</li> </ol> Combine these parts in order to get a canonical URL.

I had the same issue for a reports-configuration-save functionality. In our system, users can design his own reports of sales (like JQL of Jira); for that, we use get params as conditions, and fragment identifier (after #) as layout setup, like this: <pre class="prettyprint"><code>http://example.com/report.php?since=20180101&until=20180806#sort=amount&color=blue </code></pre> For our system, order of GET or after # params are irrelevant as well you reach the same report configuration if set param "until" first than "since", so for us are the same request. Considering this, subdomains are out of discussion, cause you must solve this using rewrite techniques (like mod_rewrite with 301 in Apache) or create a pool of domain exceptions to do this at software level. Also, different domains can point into different websites, so you must decide if is a good idea; in subdos "www" is very easy to figured it out, but it will toke you time in another cases. Server side can help to get vars in query section. For example, in PHP you can use function parse_str and $_SERVER['QUERY_STRING'] to get array, and then, you will need use asort() to order it to finnaly compare if are the same request (array_diff function). Unfortunately, server side is not an option since have no capability to get after hash (#) content, and we still without consider another problems, like scriptname included, protocols or ports: <pre class="prettyprint"><code>http://www.sub.example.com/index.php?hello=world&feed=atom https://www.sub.example.com/?hello=world&feed=atom http://www.sub.example.com:8081/?hello=world&feed=atom </code></pre> In my personal experience, the most close solution is JavaScript, for handling url, parsing query section as array, compare them and do the same with fragment identifier. If you need to use it in server side, every load page will must be followed with an ajax request sending this data to the server. Apologies in advance for length of my answer, but it is what I had to go through in order to solve the same problems you have. Greetings! Get protocol, domain, and port from URL Get protocol, domain, and port from URL How can I get query string values in JavaScript? How can I get query string values in JavaScript? How do I get the fragment identifier (value after hash #) from a URL? How do I get the fragment identifier (value after hash #) from a URL?

Convert URL into one standard format

Tags:

url

php

Here are a few URLs:

http://sub.example.com/?feed=atom&hello=world
http://www.sub.example.com/?feed=atom&hello=world
http://sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom#123

As you can see, they all lead to the exact same page but the URL format is different. Here is two other basic examples:

http://example.com/hello/
http://example.com/hello

Both are the same.

I want to convert the URL into one standard format so that when I store the URL in the database, I can easily check whether if the URL string already exists in the database.

Because of the various ways of how the URL can be formatted, this can be puzzling.

What's the definitive approach to converting URL into one standard format? Maybe parse_url() route...?

Edit

As outlined in the comments, there is no definitive solution to this, but the aim is to get as close as possible with what we have without "retrieving" the page. Please read comments before posting an answer to this bounty.

970

asked Aug 04 '18 13:08

Henrik Petterson

3 Answers

After you parse_url:

Remove the www prefix from the domain name
If the path is not empty - remove the trailing slash from it
Sort query parameters alphabetically by their name - if there are any

Combine these parts in order to get a canonical URL.

104

answered Oct 23 '22 22:10

IVO GELOV

I had the same issue for a reports-configuration-save functionality. In our system, users can design his own reports of sales (like JQL of Jira); for that, we use get params as conditions, and fragment identifier (after #) as layout setup, like this:

http://example.com/report.php?since=20180101&until=20180806#sort=amount&color=blue

For our system, order of GET or after # params are irrelevant as well you reach the same report configuration if set param "until" first than "since", so for us are the same request.

Considering this, subdomains are out of discussion, cause you must solve this using rewrite techniques (like mod_rewrite with 301 in Apache) or create a pool of domain exceptions to do this at software level. Also, different domains can point into different websites, so you must decide if is a good idea; in subdos "www" is very easy to figured it out, but it will toke you time in another cases.

Server side can help to get vars in query section. For example, in PHP you can use function parse_str and $_SERVER['QUERY_STRING'] to get array, and then, you will need use asort() to order it to finnaly compare if are the same request (array_diff function).

Unfortunately, server side is not an option since have no capability to get after hash (#) content, and we still without consider another problems, like scriptname included, protocols or ports:

http://www.sub.example.com/index.php?hello=world&feed=atom
https://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com:8081/?hello=world&feed=atom

In my personal experience, the most close solution is JavaScript, for handling url, parsing query section as array, compare them and do the same with fragment identifier. If you need to use it in server side, every load page will must be followed with an ajax request sending this data to the server.

Apologies in advance for length of my answer, but it is what I had to go through in order to solve the same problems you have. Greetings!

Get protocol, domain, and port from URL Get protocol, domain, and port from URL

How can I get query string values in JavaScript? How can I get query string values in JavaScript?

How do I get the fragment identifier (value after hash #) from a URL? How do I get the fragment identifier (value after hash #) from a URL?

answered Oct 23 '22 20:10

Benjamin

adding the preferred <link rel="canonical" ... > tag into the HTML headers is the only reliable solution, in order to reference unique content to a single SEF URL. see Google's documentation, concerning Consolidate duplicate URLs, which possibly answers the whole question more autoritative and reliable, than I ever could.

the idea of being able to know of the canonical URL or to resolve a bunch externals URLs, without parsing those server's .htaccess rewrite-rules or the HTML headers, does not appear to be applicable (simply because one can maintain a table with URL aliases, which subsequently do not permit guessing how a HTTP request might have been re-written).

this question might belong to https://webmasters.stackexchange.com/search?q=cannonical.

answered Oct 23 '22 21:10

Martin Zeitler

Related questions
                            
                                How to add a course to an existing user in moodle remotely?
                            
                                In PHP, what happens in memory when we use mysql_query
                            
                                How to set up Beanstalkd with PHP
                            
                                How do I go about creating an efficient content filter for certain posts?
                            
                                Decrypting the .ASPXAUTH Cookie WITH protection=validation
                            
                                With WP_DEBUG_LOG set to true, no debug output shows in debug.log, why?
                            
                                Get the base colours from a set of hex colours
                            
                                Creating a globally accessible MySQLi object
                            
                                Laravel Eloquent nested relations returns data only on the first element
                            
                                Core ui select not working for mobile users
                            
                                Laravel - session data survives log-out/log-in, even for different users
                            
                                PHP $_COOKIE is not entirely populated
                            
                                PDF not merge greater than PDF-1.5 version using mPDF
                            
                                WordPress CPT With Ability to Login and Register
                            
                                Model relationships in Laravel 5.3
                            
                                Get bytes transferred using PHP5 for POST request
                            
                                PDFMerger with FPDI-PDF-PARSER
                            
                                Symfony 3, populating token and refreshing user
                            
                                how to use composer with docker-compose
                            
                                PhalconPHP Database transactions fail on server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With