I am writing a website crawler in php and I already have code that can extract all links from a site. A problem: sites use a combination of absolute and relative urls. Examples (http replaced with hxxp as I can't post hyperlinks):
hxxp://site.com/
site.com
site.com/index.php
hxxp://site.com/hello/index.php
/hello/index.php
hxxp://site2.com/index.php
site2.com/index.php
I have no control over the links (if they are absolute/relative), but I do need to follow them. I need to convert all these links into absolute URLs. How do I do this in php?
Here's a start
// Your crawler was sent to this page.
$url = 'http://example.com/page';
// Example of a relative link of the page above.
$relative = '/hello/index.php';
// Parse the URL the crawler was sent to.
$url = parse_url($url);
if(FALSE === filter_var($relative, FILTER_VALIDATE_URL))
{
// If the link isn't a valid URL then assume it's relative and
// construct an absolute URL.
print $url['scheme'].'://'.$url['host'].'/'.ltrim($relative, '/');
}
Have a look into the http_build_url method as another way of creating an absolute anchor.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With