I have to call a large number of APIs concurrently. I'm trying to do this via multi-threaded curl, but it seems like that it fails to get the all the API results properly (some errors out; I think it's timing out???) if I pass it a lot of URLs. 50 URLs at a time seems to be the max I can pass it, and around 100 at a time is when I really start seeing problems. Because of this, I have had to implement logic to chunk the URLs I try to curl at a given time.
Questions:
Here's the script:
function multithreaded_curl(array $urls, $concurrent_urls = 50)
{
// Data to be returned
$total_results = array();
// Chunk the URLs
$chunked_urls = array_chunk($urls, $concurrent_urls);
foreach ($chunked_urls as $chunked_url) {
// Chunked results
$results = array();
// Array of cURL handles
$curl_handles = array();
// Multi-handle
$mh = curl_multi_init();
// Loop through $chunked_urls and create curl handles, then add them to the multi-handle
foreach ($chunked_url as $k => $v) {
$curl_handles[$k] = curl_init();
curl_setopt($curl_handles[$k], CURLOPT_URL, $v);
curl_setopt($curl_handles[$k], CURLOPT_HEADER, 0);
curl_setopt($curl_handles[$k], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handles[$k], CURLOPT_SSL_VERIFYPEER, 0);
curl_multi_add_handle($mh, $curl_handles[$k]);
}
// Execute the handles
$running = NULL;
do {
curl_multi_exec($mh, $running);
} while ($running > 0);
// Get content and remove handles
foreach ($curl_handles as $k => $v) {
$results[$k] = json_decode(curl_multi_getcontent($v), TRUE);
curl_multi_remove_handle($mh, $v);
}
// All done
curl_multi_close($mh);
// Combine results
$total_results = array_merge($total_results, $results);
}
return $total_results;
}
concerning Q1: As already commented, there are several options to get problems with that algorhythm. First of all is that it probably exhausts local (handles etc.) as well as remote (maxConnections, maxThreads etc.) ressources. Do not do it that way.
concerning Q2: you don't need to (see below), but please get the error responses before guessing errors.
concerning Q3: yes, there are several options at the REMOTE webserver depending on the remote webserver's vendor (limitations on thread numbers, maximum connection count, max conn count per client etc.). If this also is your server, you can tune these to better suite your needs, but first you should tune the client algorhythm.
Overall, it does not make much sense to start more than a handful of connections at a time. Connection reusage is much faster and doesn't spoil your local handles etc. and doesn't do DOS attacks to remote systems. The only cause for doing this would be that the server needs much long for it's request processing than the io needs.
Did you check for the speed when you just do let's say 4 connections at a time and reuse them instead of creating new ones? Indeed you are populating curl_handles[] for a single use of each. Creating IO objects costs time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With