Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multi-thread, multi-curl crawler in PHP

Hi everyone once again!

We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.

Let's use some pseudo code to understand the logic:

    1) While ($links_to_be_scanned > 0).
    2) Foreach ($links_to_be_scanned as $link_to_be_scanned).
    3) Scan_the_link() and run some other functions.
    4) Extract the new links from the xdom.
    5) Push the new links into $links_to_be_scanned.
    5) Push the current link into $links_already_scanned.
    6) Remove the current link from $links_to_be_scanned.

Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.

I understand that we're gonna have to create a $links_being_scanned or some kind of queue.

I'm really not sure how to approach this problem to be honest, if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.

Thanks in advance! Chris;

Extended:

I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.

Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.

So now rethinking, we would have to do something like this:

  While (There's links to be scanned)
  Foreach ($Link_to_scann as $link)
  If (There's less than 10 scanners running)
  Launch_a_new_scanner($link)
  Remove the link from $links_to_be_scanned array
  Push the link into $links_on_queue array
  Endif;

And each scanner does (This should be run in parallel):

  Create an object with the given link
  Send a curl request to the given link
  Create a dom and an Xdom with the response body
  Perform other operations over the response body
  Remove the link from the $links_on_queue array
  Push the link into the $links_already_scanned array

I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?

Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.

I assume I would have to approach this using fsockopen or pcntl_fork.

Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!

Thanks a lot!

like image 348
Chris Russo Avatar asked Oct 30 '12 13:10

Chris Russo


2 Answers

you could try something like this, haven't checked it, but you should get the idea

$request_pool = array();

function CreateHandle($url) {
    $handle = curl_init($url);

    // set curl options here

    return $handle;
}

function Process($data) {
    global $request_pool;

    // do something with data

    array_push($request_pool , CreateHandle($some_new_url));
}

function RunMulti() {
    global $request_pool;

    $multi_handle = curl_multi_init();

    $active_request_pool = array();

    $running = 0;
    $active_request_count = 0;
    $active_request_max = 10; // adjust as necessary
    do {
        $waiting_request_count = count($request_pool);
        while(($active_request_count < $active_request_max) && ($waiting_request_count > 0)) {
            $request = array_shift($request_pool);
            curl_multi_add_handle($multi_handle , $request);
            $active_request_pool[(int)$request] = $request;

            $waiting_request_count--;
            $active_request_count++;
        }

        curl_multi_exec($multi_handle , $running);
        curl_multi_select($multi_handle);
        while($info = curl_multi_info_read($multi_handle)) {
            $curl_handle = $info['handle'];
            call_user_func('Process' , curl_multi_getcontent($curl_handle));
            curl_multi_remove_handle($multi_handle , $curl_handle);
            curl_close($curl_handle);
            $active_request_count--;
        }

    } while($active_request_count > 0 || $waiting_request_count > 0);

    curl_multi_close($multi_handle);
}
like image 40
ninaj Avatar answered Oct 05 '22 18:10

ninaj


DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.

The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.

Limiting the number of concurrent connections is easily accomplished. Consider:

<?php

use Artax\Client, Artax\Response;

require dirname(__DIR__) . '/autoload.php';

$client = new Client;

// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);

$requests = array(
    'so-home'    => 'http://stackoverflow.com',
    'so-php'     => 'http://stackoverflow.com/questions/tagged/php',
    'so-python'  => 'http://stackoverflow.com/questions/tagged/python',
    'so-http'    => 'http://stackoverflow.com/questions/tagged/http',
    'so-html'    => 'http://stackoverflow.com/questions/tagged/html',
    'so-css'     => 'http://stackoverflow.com/questions/tagged/css',
    'so-js'      => 'http://stackoverflow.com/questions/tagged/javascript'
);

$onResponse = function($requestKey, Response $r) {
    echo $requestKey, ' :: ', $r->getStatus();
};

$onError = function($requestKey, Exception $e) {
    echo $requestKey, ' :: ', $e->getMessage();
}

$client->requestMulti($requests, $onResponse, $onError);

IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.

like image 135
rdlowrey Avatar answered Oct 05 '22 18:10

rdlowrey