Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pulling data from API, memory growth

I'm working on a project where I pull data (JSON) from an API. The problem I'm having is that the memory is slowly growing until I get the dreaded fatal error:

Fatal error: Allowed memory size of * bytes exhausted (tried to allocate * bytes) in C:... on line *

I don't think there should be any memory growth. I tried unsetting everything at the end of the loop but no difference. So my question is: am I doing something wrong? Is it normal? What can I do to fix this problem?

<?php

$start = microtime(true);

$time = microtime(true) - $start;
echo "Start: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";

include ('start.php');
include ('connect.php');

set_time_limit(0);

$api_key = 'API-KEY';
$tier = 'Platinum';
$threads = 10; //number of urls called simultaneously

function multiRequest($urls, $start) {

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;start function: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    $nbrURLS = count($urls); // number of urls in array $urls
    $ch = array(); // array of curl handles
    $result = array(); // data to be returned

    $mh = curl_multi_init(); // create a multi handle 

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;Creation multi handle: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    // set URL and other appropriate options
    for($i = 0; $i < $nbrURLS; $i++) {
        $ch[$i]=curl_init();

        curl_setopt($ch[$i], CURLOPT_URL, $urls[$i]);
        curl_setopt($ch[$i], CURLOPT_RETURNTRANSFER, 1); // return data as string
        curl_setopt($ch[$i], CURLOPT_SSL_VERIFYPEER, 0); // Doesn't verifies certificate

        curl_multi_add_handle ($mh, $ch[$i]); // Add a normal cURL handle to a cURL multi handle
    }

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;For loop options: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    // execute the handles
    do {
        $mrc = curl_multi_exec($mh, $active);          
        curl_multi_select($mh, 0.1); // without this, we will busy-loop here and use 100% CPU
    } while ($active);

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp;Execution: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    echo '&nbsp;&nbsp;&nbsp;For loop2<br>';

    // get content and remove handles
    for($i = 0; $i < $nbrURLS; $i++) {

        $error = curl_getinfo($ch[$i], CURLINFO_HTTP_CODE); // Last received HTTP code 

        echo "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;error: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

        //error handling if not 200 ok code
        if($error != 200){

            if($error == 429 || $error == 500 || $error == 503 || $error == 504){
                echo "Again error: $error<br>";
                $result['again'][] = $urls[$i];

            } else {
                echo "Error error: $error<br>";
                $result['errors'][] = array("Url" => $urls[$i], "errornbr" => $error);
            }

        } else {
            $result['json'][] = curl_multi_getcontent($ch[$i]);

            echo "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Content: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";
        }

        curl_multi_remove_handle($mh, $ch[$i]);
        curl_close($ch[$i]);
    }

    $time = microtime(true) - $start;
    echo "&nbsp;&nbsp;&nbsp; after loop2: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br>";

    curl_multi_close($mh);

    return $result;
}


$gamesId = mysqli_query($connect, "SELECT gameId FROM `games` WHERE `region` = 'EUW1' AND `tier` = '$tier ' LIMIT 20 ");
$urls = array();

while($result = mysqli_fetch_array($gamesId))
{
    $urls[] = 'https://euw.api.pvp.net/api/lol/euw/v2.2/match/' . $result['gameId'] . '?includeTimeline=true&api_key=' . $api_key;
}

$time = microtime(true) - $start;
echo "After URL array: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";

$x = 1; //number of loops

while($urls){ 

    $chunk = array_splice($urls, 0, $threads); // take the first chunk ($threads) of all urls

    $time = microtime(true) - $start;
    echo "<br>After chunk: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";

    $result = multiRequest($chunk, $start); // Get json

    unset($chunk);

    $nbrComplete = count($result['json']); //number of retruned json strings

    echo 'For loop: <br/>';

    for($y = 0; $y < $nbrComplete; $y++){
        // parse the json
        $decoded = json_decode($result['json'][$y], true);

        $time = microtime(true) - $start;
        echo "&nbsp;&nbsp;&nbsp;Decode: ". memory_get_peak_usage(true) . " | " . memory_get_usage() . "<br/>";


    }

    unset($nbrComplete);
    unset($decoded);

    $time = microtime(true) - $start;
    echo $x . ": ". memory_get_peak_usage(true) . " | " . $time . "<br>";

    // reuse urls
    if(isset($result['again'])){
        $urls = array_merge($urls, $result['again']);
        unset($result['again']);
    }

    unset($result);
    unset($time);

    sleep(15); // limit the request rate

    $x++;
}

include ('end.php');

?>

PHP Version 5.3.9 - 100 loops:

loop: memory | time (sec)
1: 5505024 | 0.98330211639404
3: 6291456 | 33.190237045288
65: 6553600 | 1032.1401019096
73: 6815744 | 1160.4345710278
75: 7077888 | 1192.6274609566
100: 7077888 | 1595.2397520542

EDIT:
After trying it with PHP 5.6.14 xampp on windows:

loop: memory | time (sec)
1: 5505024 | 1.0365679264069
3: 6291456 | 33.604479074478
60: 6553600 | 945.90159296989
62: 6815744 | 977.82566595078
93: 7077888 | 1474.5941500664
94: 7340032 | 1490.6698410511
100: 7340032 | 1587.2434458733

EDIT2: I only see the memory increase after json_decode

Start: 262144 | 135448
After URL array: 262144 | 151984
After chunk: 262144 | 152272
   start function: 262144 | 152464
   Creation multi handle: 262144 | 152816
   For loop options: 262144 | 161424
   Execution: 3145728 | 1943472
   For loop2
      error: 3145728 | 1943520
      Content: 3145728 | 2095056
      error: 3145728 | 1938952
      Content: 3145728 | 2131992
      error: 3145728 | 1938072
      Content: 3145728 | 2135424
      error: 3145728 | 1933288
      Content: 3145728 | 2062312
      error: 3145728 | 1928504
      Content: 3145728 | 2124360
      error: 3145728 | 1923720
      Content: 3145728 | 2089768
      error: 3145728 | 1918936
      Content: 3145728 | 2100768
      error: 3145728 | 1914152
      Content: 3145728 | 2089272
      error: 3145728 | 1909368
      Content: 3145728 | 2067184
      error: 3145728 | 1904616
      Content: 3145728 | 2102976
    after loop2: 3145728 | 1899824
For loop: 
   Decode: 3670016 | 2962208
   Decode: 4980736 | 3241232
   Decode: 5242880 | 3273808
   Decode: 5242880 | 2802024
   Decode: 5242880 | 3258152
   Decode: 5242880 | 3057816
   Decode: 5242880 | 3169160
   Decode: 5242880 | 3122360
   Decode: 5242880 | 3004216
   Decode: 5242880 | 3277304
like image 785
PHPhil Avatar asked Oct 27 '15 10:10

PHPhil


Video Answer


2 Answers

Your method is quite long, so I don't believe that garbage collection wont get fired until the very end of a function, which means your unused variables can build up. If they aren't going to be used anymore, then garbage collection would take care of this for you.

You might think about refactoring this code into smaller methods to take advantage of this, and with all the other good stuff that comes with having smaller methods, however in the meantime you could try putting gc_collect_cycles(); at the very end of your loop to see if you can free some memory:

if(isset($result['again'])){
    $urls = array_merge($urls, $result['again']);
    unset($result['again']);
}

unset($result);
unset($time);

gc_collect_cycles();//add this line here
sleep(15); // limit the request rate

Edit : the segment I have updated actually doesn't belong to the big function, however I suspect maybe the size of $result may bowl things over, and it wont get cleaned until the loop terminates, possibly. This is a worth a try however.

like image 110
Paul Stanley Avatar answered Oct 17 '22 08:10

Paul Stanley


So my question is: am I doing something wrong? Is it normal? What can I do to fix this problem?

Yes, running out of memory is normal when you use all of it. You are requesting 10 simultaneous HTTP requests and unserializing the JSON responses in to PHP memory. Without limiting the size of the responses you will always be in danger of running out of memory.

What else can you do?

  1. Do not run multiple http connections simultaneously. Turn $threads down to 1 to test this. If there is a memory leak in a C extension calling gc_collect_cycles() will not free any memory, this only affects memory allocated in the Zend Engine which is no longer reachable.
  2. Save the results to a folder and process them in another script. You can move the processed files into a sub directory to mark when you have successfully processed a json file.
  3. Investigate forking or a message queue to have multiple processes work on a portion of the problem at the same time - either multiple PHP processes listening to a queue bucket or forked children of the parent process with their own process memory.
like image 31
chugadie Avatar answered Oct 17 '22 09:10

chugadie