Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running a python function in BigQuery

Would it be possible to run a python function in BigQuery?

It seems like C can be compiled to WebAssembly and run in BQ, per this blog post from Felipe.

And of course Python can be compiled to C or C++ using cython or some other tools (or it could even be transpiled to javascript). So then my question is does anyone have any experience executing a python function in BigQuery. If so, what is the flow that you're using to do it?

Possible options here are:

  • "Transform" the python into javascript to run.
  • Compile the python into c or cpp and compile that into wasm

Here is an example input to work with:

(1) Source

id         product  1          box      2          bottle 

(2) Python functions to use

def double_id(row):     return row['id'] * 2  def product_code(row):     # B3     return row['product'].upper()[0] + str(len(row['product'])) 

(3) Expected output

id     product      double_id       product_code 1      box          2               B3 2      bottle       4               B6 

I'm not just looking to re-write the above using javascript (which would probably be the easiest way to do this), but I'm looking for a more generalized solution, if there is one that exists -- where I can take a python (standard library) function and use it in a BigQuery query.

like image 803
David542 Avatar asked Apr 01 '19 19:04

David542


People also ask

How do you call a function in BigQuery?

After creating a persistent UDF, you can call it as you would any other function, prepended with the name of the dataset in which it is defined as a prefix. To call a UDF in a project other than the project that you are using to run the query, project_name is required.

Can Python connect to BigQuery?

The BigQuery client library for Python is automatically installed in a managed notebook. Behind the scenes, the %%bigquery magic command uses the BigQuery client library for Python to run the given query, convert the results to a pandas DataFrame, optionally save the results to a variable, and then display the results.


1 Answers

Python 3 Apache Beam + BigQuery Here’s the key Beam code to read from BigQuery and write to BigQuery:

with beam.Pipeline(RUNNER, options = opts) as p:     (p        | 'read_bq' >> beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True))       | 'compute_fit' >> beam.FlatMap(compute_fit)       | 'write_bq' >> beam.io.gcp.bigquery.WriteToBigQuery(           'ch05eu.station_stats', schema='station_id:string,ag:FLOAT64,bg:FLOAT64,cg:FLOAT64')     ) 

Essentially, we are running a query on a BigQuery table, running the Python method compute_fit, and writing the output to a BigQuery table. This is my compute_fit method. As you can see, it’s just plain Python code:

def compute_fit(row):   from scipy import stats   import numpy as np   durations = row['duration_array']   ag, bg, cg = stats.gamma.fit(durations)   if np.isfinite(ag) and np.isfinite(bg) and np.isfinite(cg):       result = {}       result['station_id'] = str(row['start_station_id'])       result['ag'] = ag       result['bg'] = bg       result['cg'] = cg       yield result 

Make sure to specify the Python packages that you need installed on the Dataflow workers in a requirements.txt:

%%writefile requirements.txt numpy scipy 

Enjoy! for more info you could refer to this document How to run Python code on your BigQuery table

like image 126
Kais Tounsi Avatar answered Oct 17 '22 12:10

Kais Tounsi