Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parse_url() PHP works strange

I'm trying to get host from url using parse_url. But in some queries i get empty results. Here is my function:

function clean_url($urls){
    $good_url=array();
    for ($i=0;$i<count($urls);$i++){
        $url=parse_url($urls[$i]);

       //$temp_string=str_replace("http://", "", $urls[$i]);
       //$temp_string=str_replace("https://", "", $urls[$i]);
       //$temp_string=substr($temp_string, 0,stripos($temp_string,"/"));
       array_push($good_url, $url['host']);
    }
    return $good_url;
}

Input array:

Array ( 
    [0] => https://en.wikipedia.org/wiki/Data 
    [1] => data.gov.ua/ 
    [2] => e-data.gov.ua/ 
    [3] => e-data.gov.ua/transaction 
    [4] => https://api.jquery.com/data/ 
    [5] => https://api.jquery.com/jquery.data/ 
    [6] => searchdatamanagement.techtarget.com/definition/data 
    [7] => www.businessdictionary.com/definition/data.html  
    [8] => https://data.world/ 
    [9] => https://en.oxforddictionaries.com/definition/data 
)

Results array with empty results

Array ( 
    [0] => en.wikipedia.org 
    [1] => 
    [2] => 
    [3] => 
    [4] => api.jquery.com 
    [5] => api.jquery.com 
    [6] => 
    [7] => 
    [8] => data< 
    [9] => en.oxforddictionaries.com 
)
like image 346
Djos Avatar asked Dec 23 '16 20:12

Djos


2 Answers

Some of those $urls that are being parsed do not have schemes which is causing parse_url to recognise the hosts as paths.

For example, parsing the url data.gov.ua/ returns data.gov.ua/ as the path. Adding a scheme e.g. https to that url so it's https://data.gov.ua/ will allow parse_url to recognise data.gov.ua/ as the host.

like image 113
Ben Plummer Avatar answered Oct 10 '22 15:10

Ben Plummer


The general format of a URL is:

scheme://hostname:port/path?query#fragment

Each part of the URL is optional, and it uses the delimiters between them to determine which parts have been provided or omitted.

The hostname is the part of the URL after the // prefix. Many of your URLs are missing this prefix, so they don't have a hostname.

For instance, parse_url('data.gov.ua/') returns:

Array
(
    [path] => data.gov.ua/
)

To get what you want, it should be parse_url('//data.gov.ua/'):

Array
(
    [host] => data.gov.ua
    [path] => /
)

This frequently confuses programmers because browsers are very forgiving about typing incomplete URLs in the location field, they have heuristics to try to decide if something is a hostname or a path. But APIs like parse_url() are more strict about it.

like image 41
Barmar Avatar answered Oct 10 '22 16:10

Barmar