Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running asynchronous queries in BigQuery not noticeably faster

I am using Google's python API client library on App Engine to run a number of queries in Big Query to generate live analytics. The calls take roughly two seconds each and with five queries, this is too long, so I looked into ways to speed things up and thought running queries asynchronously would be a solid improvement. The thinking was that I could insert the five queries at once and Google would do some magic to run them all at the same time and then use jobs.getQueryResults(jobId) to get the results for each job. I decided to test the theory out with a proof of concept by timing the execution of two asynchronous queries and comparing it to running queries synchronously. The results:

  • synchronous: 3.07 seconds (1.34s and 1.29s for each query)
  • asynchronous: 2.39 seconds (0.52s and 0.44s for each insert, plus another 1.09s for getQueryResults())

Which is only a difference of 0.68 seconds. So while asynchronous queries are faster, they aren't achieving the goal of Google parallel magic to cut down on total execution time. So first question: is that expectation of parallel magic correct? Even if it's not, of particular interest to me is Google's claim that

An asynchronous query returns a response immediately, generally before the query completes.

Roughly half a second to insert the query doesn't meet my definition of 'immediately'! I imagine Jordan or someone else on the Big Query team will be the only ones that can answer this, but I welcome any answers!

EDIT NOTES:

  1. Per Mikhail Berlyant's suggestion, I gathered creationTime, startTime and endTime from the jobs response and found:

    • creationTime to startTime: 462ms, 387ms (timing for queries 1 and 2)
    • startTime to endTime: 744ms, 1005ms

Though I'm not sure if that adds anything to the story as it's the timing between issuing insert() and the call completing that I'm wondering about.

  1. From BQ's Jobs documentation, the answer to my first question about parallel magic is yes:

You can run multiple jobs concurrently in BigQuery

CODE:

For what it's worth, I tested this both locally and on production App Engine. Local was slower by a factor of about 2-3, but replicated the results. In my research I also found out about partitioned tables, which I wish I knew about before (which may well end up being my solution) but this question stands on its own. Here is my code. I am omitting the actual SQL because they are irrelevant in this case:

    def test_sync(self, request):
    t0 = time.time()

    request = bigquery.jobs()
    data = { 'query': (sql) }
    response = request.query(projectId=project_id, body=data).execute()
    t1 = time.time()

    data = { 'query': (sql) }
    response = request.query(projectId=project_id, body=data).execute()
    t2 = time.time()

    print("0-1: " + str(t1 - t0))
    print("1-2: " + str(t2 - t1))
    print("elapsed: " + str(t2 - t0))

def test_async(self, request):
    job_ids = {}

    t0 = time.time()
    job_id = async_query(sql)
    job_ids['a'] = job_id
    print("job_id: " + job_id)
    t1 = time.time()

    job_id = async_query(sql)
    job_ids['b'] = job_id
    print("job_id: " + job_id)
    t2 = time.time()

    for key, value in job_ids.iteritems():

        response = bigquery.jobs().getQueryResults(
            jobId=value,
            projectId=project_id).execute()

    t3 = time.time()
    print("0-1: " + str(t1 - t0))
    print("1-2: " + str(t2 - t1))
    print("2-3: " + str(t3 - t2))
    print("elapsed: " + str(t3 - t0))

def async_query(sql):
    job_data = {
        'jobReference': {
            'projectId': project_id
        },
        'configuration': {
            'query': {
                'query': sql,
                'priority': 'INTERACTIVE'
            }
        }
    }

response = bigquery.jobs().insert(
    projectId=project_id,
    body=job_data).execute()
job_id = response['jobReference']['jobId']

return job_id
like image 723
Ryan DuVal Avatar asked Jul 14 '16 17:07

Ryan DuVal


1 Answers

The answer to whether running queries in parallel will speed up the results is, of course, "it depends".

When you use the asynchronous job API there is about a half a second of built-in latency that gets added to every query. This is because the API is not designed for short-running queries; if your queries run in under a second or two, you don't need asynchronous processing.

The half second latency will likely go down in the future, but there are a number of fixed costs that aren't going to get any better. For example, you're sending two HTTP requests to google instead of one. How long these take depends on where you are sending the requests from and the characteristics of the network you're using. If you're in the US, this could be only a few milliseconds round-trip time, but if you're in Brazil, it might be 100 ms.

Moreover, when you do jobs.query(), the BigQuery API server that receives the request is the same one that starts the query. It can return the results as soon as the query is done. But when you use the asynchronous api, your getQueryResults() request is going to go to a different server. That server has to either poll for the job state or find the server that is running the request to get the status. This takes time.

So if you're running a bunch of queries in parallel, each one takes 1-2 seconds, but you're adding half of a second to each one, plus it takes a half a second in the initial request, you're not likely to see a whole lot of speedup. If your queries, on the other hand, take 5 or 10 seconds each, the fixed overhead would be smaller as a percentage of the total time.

My guess is that if you ran a larger number of queries in parallel, you'd see more speedup. The other option is to use the synchronous version of the API, but use multiple threads on the client to send multiple requests in parallel.

There is one more caveat, and that is query size. Unless you purchase extra capacity, BigQuery will, by default, give you 2000 "slots" across all of your queries. A slot is a unit of work that can be done in parallel. You can use those 2000 slots to run one giant query, or 20 smaller queries that each use 100 slots at once. If you run parallel queries that saturate your 2000 slots, you'll experience a slowdown.

That said, 2000 slots is a lot. In a very rough estimate, 2000 slots can process hundreds of Gigabytes per second. So unless you're pushing that kind of volume through BigQuery, adding parallel queries is unlikely to slow you down.

like image 161
Jordan Tigani Avatar answered Sep 30 '22 11:09

Jordan Tigani