Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-threaded Curl Can't Handle Large Number of Concurrent URLs?

I have to call a large number of APIs concurrently. I'm trying to do this via multi-threaded curl, but it seems like that it fails to get the all the API results properly (some errors out; I think it's timing out???) if I pass it a lot of URLs. 50 URLs at a time seems to be the max I can pass it, and around 100 at a time is when I really start seeing problems. Because of this, I have had to implement logic to chunk the URLs I try to curl at a given time.

Questions:

  1. What could be causing my curl problems?
  2. Is there something in curl I can set to tell it wait longer for responses - in case my problems has something to do with timeouts?
  3. Is there something in my server / php.ini I can configure to improve the performance of my script?

Here's the script:

function multithreaded_curl(array $urls, $concurrent_urls = 50)
    {
        // Data to be returned
        $total_results = array();

        // Chunk the URLs
        $chunked_urls = array_chunk($urls, $concurrent_urls);
        foreach ($chunked_urls as $chunked_url) {
            // Chunked results
            $results = array();

            // Array of cURL handles
            $curl_handles = array();

            // Multi-handle
            $mh = curl_multi_init();

            // Loop through $chunked_urls and create curl handles, then add them to the multi-handle
            foreach ($chunked_url as $k => $v) {
                $curl_handles[$k] = curl_init();

                curl_setopt($curl_handles[$k], CURLOPT_URL, $v);
                curl_setopt($curl_handles[$k], CURLOPT_HEADER, 0);
                curl_setopt($curl_handles[$k], CURLOPT_RETURNTRANSFER, 1);
                curl_setopt($curl_handles[$k], CURLOPT_SSL_VERIFYPEER, 0);

                curl_multi_add_handle($mh, $curl_handles[$k]);
            }

            // Execute the handles
            $running = NULL;
            do {
                curl_multi_exec($mh, $running);
            } while ($running > 0);

            // Get content and remove handles
            foreach ($curl_handles as $k => $v) {
                $results[$k] = json_decode(curl_multi_getcontent($v), TRUE);
                curl_multi_remove_handle($mh, $v);
            }

            // All done
            curl_multi_close($mh);

            // Combine results
            $total_results = array_merge($total_results, $results);
        }

        return $total_results;
    }
like image 513
StackOverflowNewbie Avatar asked Oct 19 '22 15:10

StackOverflowNewbie


1 Answers

concerning Q1: As already commented, there are several options to get problems with that algorhythm. First of all is that it probably exhausts local (handles etc.) as well as remote (maxConnections, maxThreads etc.) ressources. Do not do it that way.

concerning Q2: you don't need to (see below), but please get the error responses before guessing errors.

concerning Q3: yes, there are several options at the REMOTE webserver depending on the remote webserver's vendor (limitations on thread numbers, maximum connection count, max conn count per client etc.). If this also is your server, you can tune these to better suite your needs, but first you should tune the client algorhythm.

Overall, it does not make much sense to start more than a handful of connections at a time. Connection reusage is much faster and doesn't spoil your local handles etc. and doesn't do DOS attacks to remote systems. The only cause for doing this would be that the server needs much long for it's request processing than the io needs.

Did you check for the speed when you just do let's say 4 connections at a time and reuse them instead of creating new ones? Indeed you are populating curl_handles[] for a single use of each. Creating IO objects costs time.

like image 191
Synopsis Avatar answered Oct 29 '22 12:10

Synopsis