I'm trying to get host from url using parse_url. But in some queries i get empty results. Here is my function:
function clean_url($urls){
$good_url=array();
for ($i=0;$i<count($urls);$i++){
$url=parse_url($urls[$i]);
//$temp_string=str_replace("http://", "", $urls[$i]);
//$temp_string=str_replace("https://", "", $urls[$i]);
//$temp_string=substr($temp_string, 0,stripos($temp_string,"/"));
array_push($good_url, $url['host']);
}
return $good_url;
}
Input array:
Array (
[0] => https://en.wikipedia.org/wiki/Data
[1] => data.gov.ua/
[2] => e-data.gov.ua/
[3] => e-data.gov.ua/transaction
[4] => https://api.jquery.com/data/
[5] => https://api.jquery.com/jquery.data/
[6] => searchdatamanagement.techtarget.com/definition/data
[7] => www.businessdictionary.com/definition/data.html
[8] => https://data.world/
[9] => https://en.oxforddictionaries.com/definition/data
)
Results array with empty results
Array (
[0] => en.wikipedia.org
[1] =>
[2] =>
[3] =>
[4] => api.jquery.com
[5] => api.jquery.com
[6] =>
[7] =>
[8] => data<
[9] => en.oxforddictionaries.com
)
Some of those $urls
that are being parsed do not have schemes which is causing parse_url
to recognise the hosts as paths.
For example, parsing the url data.gov.ua/
returns data.gov.ua/
as the path. Adding a scheme e.g. https
to that url so it's https://data.gov.ua/
will allow parse_url
to recognise data.gov.ua/
as the host.
The general format of a URL is:
scheme://hostname:port/path?query#fragment
Each part of the URL is optional, and it uses the delimiters between them to determine which parts have been provided or omitted.
The hostname is the part of the URL after the //
prefix. Many of your URLs are missing this prefix, so they don't have a hostname.
For instance, parse_url('data.gov.ua/')
returns:
Array
(
[path] => data.gov.ua/
)
To get what you want, it should be parse_url('//data.gov.ua/')
:
Array
(
[host] => data.gov.ua
[path] => /
)
This frequently confuses programmers because browsers are very forgiving about typing incomplete URLs in the location field, they have heuristics to try to decide if something is a hostname or a path. But APIs like parse_url()
are more strict about it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With