Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The definitive PHP url parser

Before you tell me to use parse_url, it's not nearly good enough and has too many bugs. There are many questions on the subject of parsing URLs be found on here, but nearly all are to parse only some specific class of URLs or are otherwise incomplete.

I'm looking for a definitive RFC-compliant URL parser in PHP that will reliably process any URL that a browser is likely to encounter. In this I include:

  • Page-internal links #, #title
  • Page-relative URLs blah/thing.php
  • Site-relative URLs /blah/thing.php
  • Anonymous-protocol URLs //ajax.googleapis.com/ajax/libs/jquery/1.8.1/jquery.min.js
  • Callto URLs callto:+442079460123
  • File URLs file:///Users/me/thisfile.txt
  • Mailto URLs mailto:[email protected]?subject=hello, mailto:?subject=hello

and support for all the usual scheme/authentication/domain/path/query/fragment etc, and break all of those elements out into an array, with extra flags for relative/schemaless URLs. Ideally it would come with a URL reconstructor (like http_build_url) supporting the same elements, and I'd also like validation to be applied (i.e. it should be able to make a best-guess interpretation of a URL if it's invalid, but flag it as such, just like browsers do).

This answer contained a tantalising Fermat-style reference to such a beast, but it doesn't actually go anywhere.

I've looked in all the major frameworks, but they only seem to provide thin wrappers around parse_url which is generally a bad place to start since it makes so many mistakes.

So, does such a thing exist?

like image 783
Synchro Avatar asked Oct 02 '12 09:10

Synchro


People also ask

What is parse URL?

URL parsing is a function of traffic management and load-balancing products that scan URLs to determine how to forward traffic across different links or into different servers. A URL includes a protocol identifier (http, for Web traffic) and a resource name, such as www.microsoft.com.

What is parse URL in PHP?

PHP | parse_url() Function The parse_url() function is an inbuilt function in PHP which is used to return the components of a URL by parsing it. It parses an URL and return an associative array which contains its various components.


1 Answers

Not sure how many bugs parse_url() has, but this might help:

As the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential five components of a URI reference.

The following line is the regular expression for breaking-down a well-formed URI reference into its components.

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9

Source: https://www.rfc-editor.org/rfc/rfc3986#page-51

It breaks down the location as:

$2 - scheme
$4 - host
$5 - path
$6 - query string
$8 - fragment

To rebuild, you could use:

$1 . $3 . $5 . $6 . $8
like image 163
Ja͢ck Avatar answered Sep 28 '22 02:09

Ja͢ck