I use the following PHP script to get all the links on a given page, but I'm trying to get all the links on a website as a whole. <pre class="prettyprint"><code><?php function urlLooper($url){ $urlArray = array(); $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $result = curl_exec($ch); $regex='|<a.*?href="(.*?)"|'; preg_match_all($regex,$result,$parts); $links=$parts[1]; foreach($links as $link){ array_push($urlArray, $link); } curl_close($ch); foreach($urlArray as $value){ echo $value . ' '; } } $url = 'http://www.justfundraising.com/'; urlLooper($url); </code></pre> ?> Is there any way to use cURL (or any other method frankly) to get all the links on a website? I have access to the server in case you're wondering. My idea was to generate all the links from, say, the homepage and then pass those links back through the same function to get a new list of links ignoring any duplicates. I figure that way I'll get all the pages. Any help will be appreciated!

As @mario mentions above perhaps look into using phpQuery (http://code.google.com/p/phpquery/). Once you have downloaded the library and included it on your page, below is some example code showing how you can get an array that contains all the links from the string you pass to it (I have just hardcoded a string in the newDocument function as an example): <pre class="prettyprint"><code>$links = phpQuery::newDocument('<a href="test1.html">Test 1</a><a href="test2.html">Test 2</a><a href="test3.html">Test 3</a>')->find('a'); $array_links = array(); foreach($links as $r) { $array_links[] = pq($r)->attr('href'); } die("<pre>".print_r($array_links,true)."</pre>"); </code></pre> The above code will return: <pre class="prettyprint"><code>Array ( [0] => test1.html [1] => test2.html [2] => test3.html ) </code></pre> Hope this helps.

Using cURL to get all links in a website (not only the page)

Tags:

php

curl

hyperlink

I use the following PHP script to get all the links on a given page, but I'm trying to get all the links on a website as a whole.

<?php

    function urlLooper($url){

        $urlArray = array();

        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $result = curl_exec($ch);

        $regex='|<a.*?href="(.*?)"|';
        preg_match_all($regex,$result,$parts);
        $links=$parts[1];
        foreach($links as $link){
            array_push($urlArray, $link);
        }
        curl_close($ch);

        foreach($urlArray as $value){
            echo $value . '<br />';
        }
    }

    $url = 'http://www.justfundraising.com/';
    urlLooper($url);

Is there any way to use cURL (or any other method frankly) to get all the links on a website? I have access to the server in case you're wondering.

My idea was to generate all the links from, say, the homepage and then pass those links back through the same function to get a new list of links ignoring any duplicates. I figure that way I'll get all the pages.

Any help will be appreciated!

796

asked Aug 11 '11 18:08

Tom Hartman

2 Answers

As @mario mentions above perhaps look into using phpQuery (http://code.google.com/p/phpquery/). Once you have downloaded the library and included it on your page, below is some example code showing how you can get an array that contains all the links from the string you pass to it (I have just hardcoded a string in the newDocument function as an example):

$links = phpQuery::newDocument('<a href="test1.html">Test 1</a><a href="test2.html">Test 2</a><a href="test3.html">Test 3</a>')->find('a');
$array_links = array();
foreach($links as $r) {
    $array_links[] = pq($r)->attr('href');
}
die("<pre>".print_r($array_links,true)."</pre>");

The above code will return:

Array
(
    [0] => test1.html
    [1] => test2.html
    [2] => test3.html
)

Hope this helps.

159

answered Oct 26 '22 18:10

Rook

curl only fetches what you tell it to. It won't parse the contents for you, and it won't recursively fetch "external" resources referred to by the content. You'll have to rummage around in the returned HTML yourself, parse out image/script links, and use more curl calls to fetch those.

In other words, you'll have to replicate wget, which boils down to: just use wget.

answered Oct 26 '22 17:10

Marc B

Related questions
                            
                                Slow MYSQL query with sub queries using COUNT
                            
                                Extracting pictures/images within an Excel file (xls) using PHP
                            
                                Preprocessing PHP to remove functionality from built files
                            
                                If OpenID, what user data do I still need to save
                            
                                Retrieving my Facebook fan page wall posts via PHP now gives "An access token is required to request this resource."
                            
                                Facebook App via PHP SDK - Redirecting back to page where someone added my app after getting permissions
                            
                                MySQL multiple table joins
                            
                                Using memcached as a database buffer for chat messages
                            
                                Help parsing resources.arsc?
                            
                                Apache ignoring PHP headers when sending a 304
                            
                                PHP - Need help inserting arrays into Associative arrays at given keys
                            
                                What is the pros/cons of storing session data in file vs database?
                            
                                PHP memory usage in For loop keeps growing
                            
                                What is the trigger error function in PHP?
                            
                                Are there advantages to using __get/__set instead of traditional getter/setter methods except for less code?
                            
                                How to debug PHP test files in Netbeans
                            
                                Remote file upload help needed
                            
                                Post byte array from PHP to .NET WCF Service
                            
                                Organizing the directory structure of my DDD-based web application?
                            
                                check if object exists in Cloud Files (PHP API)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With