Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Parallel curl requests

I am doing a simple app that reads json data from 15 different URLs. I have a special need that I need to do this serverly. I am using file_get_contents($url).

Since I am using file_get_contents($url). I wrote a simple script, is it:

$websites = array(     $url1,     $url2,     $url3,      ...     $url15 );  foreach ($websites as $website) {     $data[] = file_get_contents($website); } 

and it was proven to be very slow, because it waits for the first request and then do the next one.

like image 537
user1205408 Avatar asked Feb 16 '12 09:02

user1205408


People also ask

How do you run curls in parallel?

The solution to this is to use the xargs command as shown alongside the curl command. The -P flag denotes the number of requests in parallel. The section <(printf '%s\n' {1.. 10}) prints out the numbers 1 – 10 and causes the curl command to run 10 times with 5 requests running in parallel.

Is PHP cURL asynchronous?

Short answer is no it isn't asynchronous. Longer answer is "Not unless you wrote the backend yourself to do so." If you're using XHR, each request is going to have a different worker thread on the backend which means no request should block any other, barring hitting process and memory limits.

How do you use cURL multi?

To use the multi interface, you must first create a 'multi handle' with curl_multi_init. This handle is then used as input to all further curl_multi_* functions. With a multi handle and the multi interface you can do several simultaneous transfers in parallel. Each single transfer is built up around an easy handle.

What is Curl_multi_exec?

curl_multi_exec(CurlMultiHandle $multi_handle , int &$still_running ): int. Processes each of the handles in the stack. This method can be called whether or not a handle needs to read or write data.


2 Answers

If you mean multi-curl then, something like this might help:

  $nodes = array($url1, $url2, $url3); $node_count = count($nodes);  $curl_arr = array(); $master = curl_multi_init();  for($i = 0; $i < $node_count; $i++) {     $url =$nodes[$i];     $curl_arr[$i] = curl_init($url);     curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);     curl_multi_add_handle($master, $curl_arr[$i]); }  do {     curl_multi_exec($master,$running); } while($running > 0);   for($i = 0; $i < $node_count; $i++) {     $results[] = curl_multi_getcontent  ( $curl_arr[$i]  ); } print_r($results);  

Hope it helps in some way

like image 76
Sudhir Bastakoti Avatar answered Sep 26 '22 20:09

Sudhir Bastakoti


i don't particularly like the approach of any of the existing answers

Timo's code: might sleep/select() during CURLM_CALL_MULTI_PERFORM which is wrong, it might also fail to sleep when ($still_running > 0 && $exec != CURLM_CALL_MULTI_PERFORM) which may make the code spin at 100% cpu usage (of 1 core) for no reason

Sudhir's code: will not sleep when $still_running > 0 , and spam-call the async-function curl_multi_exec() until everything has been downloaded, which cause php to use 100% cpu (of 1 cpu core) until everything has been downloaded, in other words it fails to sleep while downloading

here's an approach with neither of those issues:

$websites = array(     "http://google.com",     "http://example.org"     // $url2,     // $url3,     // ...     // $url15 ); $mh = curl_multi_init(); foreach ($websites as $website) {     $worker = curl_init($website);     curl_setopt_array($worker, [         CURLOPT_RETURNTRANSFER => 1     ]);     curl_multi_add_handle($mh, $worker); } for (;;) {     $still_running = null;     do {         $err = curl_multi_exec($mh, $still_running);     } while ($err === CURLM_CALL_MULTI_PERFORM);     if ($err !== CURLM_OK) {         // handle curl multi error?     }     if ($still_running < 1) {         // all downloads completed         break;     }     // some haven't finished downloading, sleep until more data arrives:     curl_multi_select($mh, 1); } $results = []; while (false !== ($info = curl_multi_info_read($mh))) {     if ($info["result"] !== CURLE_OK) {         // handle download error?     }     $results[curl_getinfo($info["handle"], CURLINFO_EFFECTIVE_URL)] = curl_multi_getcontent($info["handle"]);     curl_multi_remove_handle($mh, $info["handle"]);     curl_close($info["handle"]); } curl_multi_close($mh); var_export($results); 

note that an issue shared by all 3 approaches here (my answer, and Sudhir's answer, and Timo's answer) is that they will open all connections simultaneously, if you have 1,000,000 websites to fetch, these scripts will try to open 1,000,000 connections simultaneously. if you need to like.. only download 50 websites at a time, or something like that, maybe try:

$websites = array(     "http://google.com",     "http://example.org"     // $url2,     // $url3,     // ...     // $url15 ); var_dump(fetch_urls($websites,50)); function fetch_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $return_fault_reason = true): array {     if ($max_connections < 1) {         throw new InvalidArgumentException("max_connections MUST be >=1");     }     foreach ($urls as $key => $foo) {         if (! is_string($foo)) {             throw new \InvalidArgumentException("all urls must be strings!");         }         if (empty($foo)) {             unset($urls[$key]); // ?         }     }     unset($foo);     // DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.     $ret = array();     $mh = curl_multi_init();     $workers = array();     $work = function () use (&$ret, &$workers, &$mh, $return_fault_reason) {         // > If an added handle fails very quickly, it may never be counted as a running_handle         while (1) {             do {                 $err = curl_multi_exec($mh, $still_running);             } while ($err === CURLM_CALL_MULTI_PERFORM);             if ($still_running < count($workers)) {                 // some workers finished, fetch their response and close them                 break;             }             $cms = curl_multi_select($mh, 1);             // var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);         }         while (false !== ($info = curl_multi_info_read($mh))) {             // echo "NOT FALSE!";             // var_dump($info);             {                 if ($info['msg'] !== CURLMSG_DONE) {                     continue;                 }                 if ($info['result'] !== CURLE_OK) {                     if ($return_fault_reason) {                         $ret[$workers[(int) $info['handle']]] = print_r(array(                             false,                             $info['result'],                             "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result'])                         ), true);                     }                 } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {                     if ($return_fault_reason) {                         $ret[$workers[(int) $info['handle']]] = print_r(array(                             false,                             $err,                             "curl error " . $err . ": " . curl_strerror($err)                         ), true);                     }                 } else {                     $ret[$workers[(int) $info['handle']]] = curl_multi_getcontent($info['handle']);                 }                 curl_multi_remove_handle($mh, $info['handle']);                 assert(isset($workers[(int) $info['handle']]));                 unset($workers[(int) $info['handle']]);                 curl_close($info['handle']);             }         }         // echo "NO MORE INFO!";     };     foreach ($urls as $url) {         while (count($workers) >= $max_connections) {             // echo "TOO MANY WORKERS!\n";             $work();         }         $neww = curl_init($url);         if (! $neww) {             trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of system resources", E_USER_WARNING);             if ($return_fault_reason) {                 $ret[$url] = array(                     false,                     - 1,                     "curl_init() failed"                 );             }             continue;         }         $workers[(int) $neww] = $url;         curl_setopt_array($neww, array(             CURLOPT_RETURNTRANSFER => 1,             CURLOPT_SSL_VERIFYHOST => 0,             CURLOPT_SSL_VERIFYPEER => 0,             CURLOPT_TIMEOUT_MS => $timeout_ms         ));         curl_multi_add_handle($mh, $neww);         // curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS     }     while (count($workers) > 0) {         // echo "WAITING FOR WORKERS TO BECOME 0!";         // var_dump(count($workers));         $work();     }     curl_multi_close($mh);     return $ret; } 

that will download the entire list and not download more than 50 urls simultaneously (but even that approach stores all the results in-ram, so even that approach may end up running out of ram; if you want to store it in a database instead of in ram, the curl_multi_getcontent part can be modified to store it in a database instead of in a ram-persistent variable.)

like image 42
hanshenrik Avatar answered Sep 26 '22 20:09

hanshenrik