Would it be possible to run a python function in BigQuery? It seems like C can be compiled to WebAssembly and run in BQ, per this blog post from Felipe. And of course Python can be compiled to C or C++ using cython or some other tools (or it could even be transpiled to javascript). So then my question is does anyone have any experience executing a python function in BigQuery. If so, what is the flow that you're using to do it? Possible options here are: <ul> <li>"Transform" the python into javascript to run.</li> <li>Compile the python into c or cpp and compile that into wasm</li> </ul> Here is an example input to work with: (1) Source <pre class="prettyprint"><code>id product 1 box 2 bottle </code></pre> (2) Python functions to use <pre class="prettyprint"><code>def double_id(row): return row['id'] * 2 def product_code(row): # B3 return row['product'].upper()[0] + str(len(row['product'])) </code></pre> (3) Expected output <pre class="prettyprint"><code>id product double_id product_code 1 box 2 B3 2 bottle 4 B6 </code></pre> I'm not just looking to re-write the above using javascript (which would probably be the easiest way to do this), but I'm looking for a more generalized solution, if there is one that exists -- where I can take a python (standard library) function and use it in a BigQuery query.

Python 3 Apache Beam + BigQuery Here’s the key Beam code to read from BigQuery and write to BigQuery: <pre class="prettyprint"><code>with beam.Pipeline(RUNNER, options = opts) as p: (p | 'read_bq' >> beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True)) | 'compute_fit' >> beam.FlatMap(compute_fit) | 'write_bq' >> beam.io.gcp.bigquery.WriteToBigQuery( 'ch05eu.station_stats', schema='station_id:string,ag:FLOAT64,bg:FLOAT64,cg:FLOAT64') ) </code></pre> Essentially, we are running a query on a BigQuery table, running the Python method compute_fit, and writing the output to a BigQuery table. This is my compute_fit method. As you can see, it’s just plain Python code: <pre class="prettyprint"><code>def compute_fit(row): from scipy import stats import numpy as np durations = row['duration_array'] ag, bg, cg = stats.gamma.fit(durations) if np.isfinite(ag) and np.isfinite(bg) and np.isfinite(cg): result = {} result['station_id'] = str(row['start_station_id']) result['ag'] = ag result['bg'] = bg result['cg'] = cg yield result </code></pre> Make sure to specify the Python packages that you need installed on the Dataflow workers in a requirements.txt: <pre class="prettyprint"><code>%%writefile requirements.txt numpy scipy </code></pre> Enjoy! for more info you could refer to this document How to run Python code on your BigQuery table

Running a python function in BigQuery

Tags:

python

transpiler

webassembly

google-bigquery

Would it be possible to run a python function in BigQuery?

It seems like C can be compiled to WebAssembly and run in BQ, per this blog post from Felipe.

And of course Python can be compiled to C or C++ using cython or some other tools (or it could even be transpiled to javascript). So then my question is does anyone have any experience executing a python function in BigQuery. If so, what is the flow that you're using to do it?

Possible options here are:

"Transform" the python into javascript to run.
Compile the python into c or cpp and compile that into wasm

Here is an example input to work with:

(1) Source

id         product  1          box      2          bottle

(2) Python functions to use

def double_id(row):     return row['id'] * 2  def product_code(row):     # B3     return row['product'].upper()[0] + str(len(row['product']))

(3) Expected output

id     product      double_id       product_code 1      box          2               B3 2      bottle       4               B6

I'm not just looking to re-write the above using javascript (which would probably be the easiest way to do this), but I'm looking for a more generalized solution, if there is one that exists -- where I can take a python (standard library) function and use it in a BigQuery query.

803

asked Apr 01 '19 19:04

David542

1 Answers

Python 3 Apache Beam + BigQuery Here’s the key Beam code to read from BigQuery and write to BigQuery:

with beam.Pipeline(RUNNER, options = opts) as p:     (p        | 'read_bq' >> beam.io.Read(beam.io.BigQuerySource(query=query, use_standard_sql=True))       | 'compute_fit' >> beam.FlatMap(compute_fit)       | 'write_bq' >> beam.io.gcp.bigquery.WriteToBigQuery(           'ch05eu.station_stats', schema='station_id:string,ag:FLOAT64,bg:FLOAT64,cg:FLOAT64')     )

Essentially, we are running a query on a BigQuery table, running the Python method compute_fit, and writing the output to a BigQuery table. This is my compute_fit method. As you can see, it’s just plain Python code:

def compute_fit(row):   from scipy import stats   import numpy as np   durations = row['duration_array']   ag, bg, cg = stats.gamma.fit(durations)   if np.isfinite(ag) and np.isfinite(bg) and np.isfinite(cg):       result = {}       result['station_id'] = str(row['start_station_id'])       result['ag'] = ag       result['bg'] = bg       result['cg'] = cg       yield result

Make sure to specify the Python packages that you need installed on the Dataflow workers in a requirements.txt:

%%writefile requirements.txt numpy scipy

Enjoy! for more info you could refer to this document How to run Python code on your BigQuery table

126

answered Oct 17 '22 12:10

Kais Tounsi

Related questions
                            
                                Python: finding lowest integer
                            
                                How to create objects on the fly in python?
                            
                                How to recreate a deleted table with Django Migrations?
                            
                                Convert a list to a string and back
                            
                                bash: syntax error near unexpected token `(' - Python
                            
                                'module' object has no attribute 'choice' - trying to use random.choice
                            
                                Converting binary to decimal integer output
                            
                                Basics of recursion in Python
                            
                                python dictionary is thread safe?
                            
                                Set Windows command-line terminal title in Python
                            
                                Installing SciPy on Ubuntu
                            
                                Python 3: create a list of possible ip addresses from a CIDR notation
                            
                                Python: return float 1.0 as int 1 but float 1.5 as float 1.5 [duplicate]
                            
                                python pip install psycopg2 install error
                            
                                Django, redirect all non-authenticated users to landing page
                            
                                Python Join a list of integers
                            
                                matplotlib ylabel TypeError: 'str' object is not callable [duplicate]
                            
                                FloatingPointError from PyMC in sampling from a Dirichlet distribution
                            
                                Travelling salesman with a directional constraint
                            
                                Boost::Python, converting tuple to Python works, vector<tuple> does not

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With