Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Downloading pages in parallel using PHP

I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.

I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.

In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.

The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.

My questions are

  • How to call such downloder asynchronously, I don't want my main script to wait till downloder completes.
  • Any location to store downloaded data, such as shared memory. Of course, other than database.
  • There any chances that data gets corrupt while storing and retrieving, how to avoid this?
  • Also, please guide me know if anyone have a better plan.
like image 837
Uday Sawant Avatar asked Feb 10 '12 19:02

Uday Sawant


1 Answers

When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls... Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I'm not posting full version here of course ("small" is still quite a bit of code), but here's a simplified version of the main thing to give you the general idea:

public function launch() {
    $channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
    $activeJobs = array();
    $running = 0;
    do {
        // pick jobs for free channels:
        while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
            // take free channel, (re)init curl handle and let
            // queued object set options
            $chId = key($freeChannels);
            if (empty($channels[$chId])) {
                $channels[$chId] = curl_init();
            }
            $job = array_pop($this->jobQueue);
            $job->init($channels[$chId]);
            curl_multi_add_handle($this->master, $channels[$chId]);
            $activeJobs[$chId] = $job;
            unset($freeChannels[$chId]);
        }
        $pending = count($activeJobs);

        // launch them:
        if ($pending > 0) {
            while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
                // poke it while it wants
            curl_multi_select($this->master);
                // wait for some activity, don't eat CPU
            while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
                // some connection(s) finished, locate that job and run response handler:
                $pending--;
                $chId = array_search($info['handle'], $channels);
                $content = curl_multi_getcontent($channels[$chId]);
                curl_multi_remove_handle($this->master, $channels[$chId]);
                $freeChannels[$chId] = NULL;
                    // free up this channel
                if ( !array_key_exists($chId, $activeJobs) ) {
                    // impossible, but...
                    continue;
                }
                $activeJobs[$chId]->onComplete($content);
                unset($activeJobs[$chId]);
            }
        }
    } while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
}

In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete. With this structure new requests will start as soon as something out of the pool finishes.

Of course it doesn't really save you if not just retrieving takes time but processing as well... And it isn't a true parallel handling. But I still hope it helps. :)

P.S. did a trick for me. :) Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can't describe that feeling. :) I didn't really expect it to work as planned, because with PHP it rarely works exactly as supposed... That was like "ok, hope it finishes in at least an hour... Wha... Wait... Already?! 8-O"

like image 167
Slava Avatar answered Sep 21 '22 00:09

Slava