I have problems parsing an URL than doesn't have a path but has a slash in the query. For example: http://example.com?q=a/b
I'm aware that such an URL is most likely invalid (*) - it requires at least a slash as the path like this: http://example.com/?q=a/b
.
All browsers in which I tried such an URL in, correct the URL automatically. And that is basically what I want to reproduce: Identify and correct such an URL.
Using parse_url
however produces:
var_dump( parse_url('http://example.com?q=a/b') );
array(3) {
["scheme"]=>
string(4) "http"
["host"]=>
string(15) "example.com?q=a"
["path"]=>
string(2) "/b"
}
While with an URL without a slash in the query it works fine:
var_dump( parse_url('http://example.com?q=ab') );
array(3) {
["scheme"]=>
string(4) "http"
["host"]=>
string(11) "example.com"
["query"]=>
string(4) "q=ab"
}
All external libraries I tried (Jwage\Purl, League\Url, Sabre\Uri) basically do the same thing, which surprises me a bit.
Why do (all?) browsers get it "right", while (all?) PHP libraries get it "wrong"?
Other than trying to catch these cases with a regular expression before parsing the URL (which may be unreliable - that's why I want to use a library in the first place), what alternatives do I have?
(*) I consulted three sources: RFC 1738, RFC 3986, WHATWG URL Standard and they all three disagree on what is considered valid.
In case you still want to apply a regular expression, the following should generate the URL you are looking for:
$url=pcre_replace('/([^/]+:\/\/[^/]+)\?/', '$1/?',$url);
It requires for the URL to start with a protocol name of at least one character followed by "://", a domain name of at least one character ("localhost" would be acceptable too). After that it will insert '/' before a '?', but only if there is no further '/' before the '?'.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With