Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using cURL to get all links in a website (not only the page)

I use the following PHP script to get all the links on a given page, but I'm trying to get all the links on a website as a whole.

<?php

    function urlLooper($url){

        $urlArray = array();

        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $result = curl_exec($ch);

        $regex='|<a.*?href="(.*?)"|';
        preg_match_all($regex,$result,$parts);
        $links=$parts[1];
        foreach($links as $link){
            array_push($urlArray, $link);
        }
        curl_close($ch);

        foreach($urlArray as $value){
            echo $value . '<br />';
        }
    }

    $url = 'http://www.justfundraising.com/';
    urlLooper($url);

?>

Is there any way to use cURL (or any other method frankly) to get all the links on a website? I have access to the server in case you're wondering.

My idea was to generate all the links from, say, the homepage and then pass those links back through the same function to get a new list of links ignoring any duplicates. I figure that way I'll get all the pages.

Any help will be appreciated!

like image 796
Tom Hartman Avatar asked Aug 11 '11 18:08

Tom Hartman


People also ask

How do I download a Web page using curl?

To download you just need to use the basic curl command but add your username and password like this curl --user username:password -o filename. tar. gz ftp://domain.com/directory/filename.tar.gz . To upload you need to use both the –user option and the -T option as follows.

Can hyperlinks only point to webpages?

Websites use hyperlinks as a way to navigate online content. Hyperlinks can point to web content that is part of that website, or it can point to web content that is part of external websites. Both images and text can be used to create a hyperlink.


2 Answers

As @mario mentions above perhaps look into using phpQuery (http://code.google.com/p/phpquery/). Once you have downloaded the library and included it on your page, below is some example code showing how you can get an array that contains all the links from the string you pass to it (I have just hardcoded a string in the newDocument function as an example):

$links = phpQuery::newDocument('<a href="test1.html">Test 1</a><a href="test2.html">Test 2</a><a href="test3.html">Test 3</a>')->find('a');
$array_links = array();
foreach($links as $r) {
    $array_links[] = pq($r)->attr('href');
}
die("<pre>".print_r($array_links,true)."</pre>");

The above code will return:

Array
(
    [0] => test1.html
    [1] => test2.html
    [2] => test3.html
)

Hope this helps.

like image 159
Rook Avatar answered Oct 26 '22 18:10

Rook


curl only fetches what you tell it to. It won't parse the contents for you, and it won't recursively fetch "external" resources referred to by the content. You'll have to rummage around in the returned HTML yourself, parse out image/script links, and use more curl calls to fetch those.

In other words, you'll have to replicate wget, which boils down to: just use wget.

like image 41
Marc B Avatar answered Oct 26 '22 17:10

Marc B