Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery PHP API - large query result memory bloat - even with paging

I am running a range of queries in BigQuery and exporting them to CSV via PHP. There are reasons why this is the easiest method for me to do this (multiple queries dependent on variables within an app).

I am struggling with memory issues when the result set is larger than 100mb. It appears that the memory usage of my code seems to grow in line with the result set, which I thought would be avoided by paging. Here is my code:

$query = $bq->query($myQuery);
$queryResults = $bq->runQuery($query,['maxResults'=>5000]);

$FH = fopen($storagepath, 'w');

$rows = $queryResults->rows();

foreach ($rows as $row) {
    fputcsv($FH, $row);
}

fclose($FH);

The $queryResults->rows() function returns a Google Iterator which uses paging to scroll through the results, so I do not understand why memory usage grows as the script runs.

Am I missing a way to discard previous pages from memory as I page through the results?

UPDATE

I have noticed that actually since upgrading to the v1.4.3 BigQuery PHP API, the memory usage does cap out at 120mb for this process, even when the result set reaches far beyond this (currently processing a 1gb result set). But still, 120mb seems too much. How can I identify and fix where this memory is being used?

UPDATE 2 This 120mb seems to be tied at 24kb per maxResult in the page. E.g. adding 1000 rows to maxResults adds 24mb of memory. So my question is now why is 1 row of data using 24kb in the Google Iterator? Is there a way to reduce this? The data itself is < 1kb per row.

like image 478
lufc Avatar asked May 17 '26 05:05

lufc


1 Answers

Answering my own question

The extra memory is used by a load of PHP type mapping and other data structure info that comes alongside the data from BigQuery. Unfortunately I couldn't find a way to reduce the memory usage below around 24kb per row multiplied by the page size. If someone finds a way to reduce the bloat that comes along with the data please post below.

However thanks to one of the comments I realized you can extract a query directly to CSV in a Google Cloud Storage Bucket. This is really easy:

query = $bq->query($myQuery);

$queryResults = $bq->runQuery($query);

$qJobInfo = $queryResults->job()->info();

$dataset = $bq->dataset($qJobInfo['configuration']['query']['destinationTable']['datasetId']);

$table = $dataset->table($qJobInfo['configuration']['query']['destinationTable']['tableId']);

$extractJob = $table->extract('gs://mybucket/'.$filename.'.csv');

$table->runJob($extractJob);

However this still didn't solve my issue as my result set was over 1gb, so I had to make use of the data sharding function by adding a wildcard.

$extractJob = $table->extract('gs://mybucket/'.$filename.'*.csv');

This created ~100 shards in the bucket. These need to be recomposed using gsutil compose <shard filenames> <final filename>. However, gsutil only lets you compose 32 files at a time. Given I will have variable numbers of shards, opten above 32, I had to write some code to clean them up.

//Save above job as variable
$eJob = $table->runJob($extractJob);

$eJobInfo = $eJob->info();

//This bit of info from the job tells you how many shards were created
$eJobFiles = $eJobInfo['statistics']['extract']['destinationUriFileCounts'][0];

$composedFiles = 0; $composeLength = 0; $subfile = 0; $fileString = "";

while (($composedFiles < $eJobFiles) && ($eJobFiles>1)) {

    while (($composeLength < 32) && ($composedFiles < $eJobFiles)) {
        // gsutil creates shards with a 12 digit number after the filename, so build a string of 32 such filenames at a time                            
        $fileString .= "gs://bucket/$filename" . str_pad($composedFiles,12,"0",STR_PAD_LEFT) . ".csv ";

        $composedFiles++;

        $composeLength++;

    }

    $composeLength = 0;

    // Compose a batch of 32 into a subfile
    system("gsutil compose $fileString gs://bucket/".$filename."-".$subfile.".csv");

    $subfile++;

    $fileString="";
}

if ($eJobFiles > 1) {
    //Compose all the subfiles                        
    system('gsutil compose gs://bucket/'.$filename.'-* gs://fm-sparkbeyond/YouTube_1_0/' . $filepath . '.gz') ==$
}

Note in order to give my Apache user access to gsutil I had to allow the user to create a .config directory in the web root. Ideally you would use the gsutil PHP library, but I didn't want the code bloat.

If anyone has a better answer please post it

  1. Is there a way to get smaller output from the BigQuery library than 24kb per row?

  2. Is there a more efficient way to clean up variable numbers of shards?

like image 167
lufc Avatar answered May 18 '26 22:05

lufc



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!