Would it be possible to run a python function in BigQuery?
It seems like C can be compiled to WebAssembly and run in BQ, per this blog post from Felipe.
And of course Python can be compiled to C or C++ using cython or some other tools (or it could even be transpiled to javascript). So then my question is does anyone have any experience executing a python function in BigQuery. If so, what is the flow that you're using to do it?
Possible options here are:
Here is an example input to work with:
(1) Source
id product 1 box 2 bottle
(2) Python functions to use
def double_id(row): return row['id'] * 2 def product_code(row): # B3 return row['product'].upper()[0] + str(len(row['product']))
(3) Expected output
id product double_id product_code 1 box 2 B3 2 bottle 4 B6
I'm not just looking to re-write the above using javascript (which would probably be the easiest way to do this), but I'm looking for a more generalized solution, if there is one that exists -- where I can take a python (standard library) function and use it in a BigQuery query.
After creating a persistent UDF, you can call it as you would any other function, prepended with the name of the dataset in which it is defined as a prefix. To call a UDF in a project other than the project that you are using to run the query, project_name is required.
The BigQuery client library for Python is automatically installed in a managed notebook. Behind the scenes, the %%bigquery magic command uses the BigQuery client library for Python to run the given query, convert the results to a pandas DataFrame, optionally save the results to a variable, and then display the results.
Python 3 Apache Beam + BigQuery Here’s the key Beam code to read from BigQuery and write to BigQuery:
with beam.Pipeline(RUNNER, options = opts) as p: (p | 'read_bq' >> beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True)) | 'compute_fit' >> beam.FlatMap(compute_fit) | 'write_bq' >> beam.io.gcp.bigquery.WriteToBigQuery( 'ch05eu.station_stats', schema='station_id:string,ag:FLOAT64,bg:FLOAT64,cg:FLOAT64') )
Essentially, we are running a query on a BigQuery table, running the Python method compute_fit, and writing the output to a BigQuery table. This is my compute_fit method. As you can see, it’s just plain Python code:
def compute_fit(row): from scipy import stats import numpy as np durations = row['duration_array'] ag, bg, cg = stats.gamma.fit(durations) if np.isfinite(ag) and np.isfinite(bg) and np.isfinite(cg): result = {} result['station_id'] = str(row['start_station_id']) result['ag'] = ag result['bg'] = bg result['cg'] = cg yield result
Make sure to specify the Python packages that you need installed on the Dataflow workers in a requirements.txt:
%%writefile requirements.txt numpy scipy
Enjoy! for more info you could refer to this document How to run Python code on your BigQuery table
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With