Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I request paginated BigQuery query results using pageTokens with the Google Client lib for Java?

I want to run BigQuery queries with thousands of rows of total results, but I only want to retrieve a page of 100 results at a time (using the maxResults and pageToken parameters).

The BigQuery API supports the use of pageToken parameters on collection.list methods. However, I am running asynchronous queries and retrieving the results using the getQueryResult method, and it doesn't seem to support the pageToken parameter. Is it possible to use pageTokens with getQueryResults?

like image 411
Michael Manoochehri Avatar asked Feb 11 '13 07:02

Michael Manoochehri


1 Answers

Update: There's new documentation about how to page through list results here.

I am self-answering this question, because a developer asked me this privately and I want to share the answer on Stack Overflow.

The pageToken parameter is available to use when requesting paginated results from the Tabledata.list method. Result sets are paginated automatically when, for example, the result data is over 100k rows or 10 MB of results. You can also request result pagination by setting the maxResults parameter explicitly. Each page of results will return a pageToken parameter, which can then be used to retrieve the next page of results.

Every query results in a new BigQuery table. If you don't name the table explicitly, it only lasts for 24 hours. However, even unnamed "anonymous" tables have an identifier. In either case, after inserting a query job, retrieve the name of newly created table. Then use the tabledata.list method (and a combination of the maxResults/pageToken parameters) to request results in paginated form. Loop and continue to call tabledata.list using the previously retrieved pageToken until the pageTokens are no longer is returned (meaning that you have reached the last page.

Using the Google API Client library for Java, the code for inserting a query job, polling for query completion, and then retrieving page after page of query results might look something like this:

// Create a new BigQuery client authorized via OAuth 2.0 protocol
// See: https://developers.google.com/bigquery/docs/authorization#installed-applications
Bigquery bigquery = createAuthorizedClient();

// Start a Query Job
String querySql = "SELECT TOP(word, 500), COUNT(*) FROM publicdata:samples.shakespeare";
JobReference jobId = startQuery(bigquery, PROJECT_ID, querySql);

// Poll for Query Results, return result output
TableReference completedJob = checkQueryResults(bigquery, PROJECT_ID, jobId);

// Return and display the results of the Query Job
displayQueryResults(bigquery, completedJob);

/**
 * Inserts a Query Job for a particular query
 */
public static JobReference startQuery(Bigquery bigquery, String projectId,
                                      String querySql) throws IOException {
  System.out.format("\nInserting Query Job: %s\n", querySql);

  Job job = new Job();
  JobConfiguration config = new JobConfiguration();
  JobConfigurationQuery queryConfig = new JobConfigurationQuery();
  config.setQuery(queryConfig);

  job.setConfiguration(config);
  queryConfig.setQuery(querySql);

  Insert insert = bigquery.jobs().insert(projectId, job);
  insert.setProjectId(projectId);
  JobReference jobId = insert.execute().getJobReference();

  System.out.format("\nJob ID of Query Job is: %s\n", jobId.getJobId());

  return jobId;
}

/**
 * Polls the status of a BigQuery job, returns TableReference to results if "DONE"
 */
private static TableReference checkQueryResults(Bigquery bigquery, String projectId, JobReference jobId)
    throws IOException, InterruptedException {
  // Variables to keep track of total query time
  long startTime = System.currentTimeMillis();
  long elapsedTime;

  while (true) {
    Job pollJob = bigquery.jobs().get(projectId, jobId.getJobId()).execute();
    elapsedTime = System.currentTimeMillis() - startTime;
    System.out.format("Job status (%dms) %s: %s\n", elapsedTime,
        jobId.getJobId(), pollJob.getStatus().getState());
    if (pollJob.getStatus().getState().equals("DONE")) {
      return pollJob.getConfiguration().getQuery().getDestinationTable();
    }
    // Pause execution for one second before polling job status again, to
    // reduce unnecessary calls to the BigQUery API and lower overall
    // application bandwidth.
    Thread.sleep(1000);
  }
}

/**
 * Page through the result set
 */
private static void displayQueryResults(Bigquery bigquery,
                                        TableReference completedJob) throws IOException {

    long maxResults = 20;
    String pageToken = null;
    int page = 1;

  // Default to not looping
    boolean moreResults = false;

    do {
    TableDataList queryResult = bigquery.tabledata().list(
            completedJob.getProjectId(),
            completedJob.getDatasetId(),
            completedJob.getTableId())
                .setMaxResults(maxResults)
                .setPageToken(pageToken)
         .execute();
    List<TableRow> rows = queryResult.getRows();
    System.out.print("\nQuery Results, Page #" + page + ":\n------------\n");
    for (TableRow row : rows) {
      for (TableCell field : row.getF()) {
      System.out.printf("%-50s", field.getV());
       }
      System.out.println();
    }
    if (queryResult.getPageToken() != null) {
      pageToken = queryResult.getPageToken();
      moreResults = true;
      page++;
    } else {
      moreResults = false;
    }
  } while (moreResults);
}
like image 186
Michael Manoochehri Avatar answered Oct 07 '22 17:10

Michael Manoochehri