Random Sampling in Google BigQuery

Q: How do I select a random sample in SQL?

To get a single row randomly, we can use the LIMIT Clause and set to only one row. ORDER BY clause in the query is used to order the row(s) randomly. It is exactly the same as MYSQL. Just replace RAND( ) with RANDOM( ).

Q: What is BigQuery not good for?

You need to understand that BigQuery cannot be used to substitute a relational database, and it is oriented on running analytical queries, not for simple CRUD operations and queries.

Tags:

google-cloud-platform

google-bigquery

I just discovered that the RAND() function, while undocumented, works in BigQuery. I was able to generate a (seemingly) random sample of 10 words from the Shakespeare dataset using:

SELECT word FROM (SELECT rand() as random,word FROM [publicdata:samples.shakespeare] ORDER BY random) LIMIT 10

My question is: Are there any disadvantages to using this approach instead of the HASH() method defined in the "Advanced examples" section of the reference manual? https://developers.google.com/bigquery/query-reference

236

asked Apr 29 '14 21:04

David M Smith

2 Answers

Great to know RAND() is available!

In my case I needed a predefined sample size. Instead of needing to know the total number of rows and do the division sample size over total rows, I'm using the following query:

SELECT word, rand(5) as rand FROM [publicdata:samples.shakespeare] order by rand #Sample size needed = 10 limit 10

Summarizing, I use ORDER BY + LIMIT to ramdomize and then extract a defined number of samples.

answered Oct 07 '22 15:10

fernandosjp

For stratified sampling, check https://stackoverflow.com/a/52901452/132438

Good job finding it :). I requested the function recently, but it hasn't made it to documentation yet.

I would say the advantage of RAND() is that the results will vary, while HASH() will keep giving you the same results for the same values (not guaranteed over time, but you get the idea).

In case you want the variability that RAND() brings while still getting consistent results - you can seed it with an integer, as in RAND(3).

Notice though that the example you pasted is doing a full sort of the random values - for sufficiently big inputs this approach won't scale.

A scalable approach, to get around 10 random rows:

SELECT word FROM [publicdata:samples.shakespeare] WHERE RAND() < 10/164656

(where 10 is the approximate number of results I want to get, and 164656 the number of rows that table has)

standardSQL update:

#standardSQL SELECT word FROM `publicdata.samples.shakespeare` WHERE RAND() < 10/164656

or even:

#standardSQL SELECT word FROM `publicdata.samples.shakespeare` WHERE RAND() < 10/(SELECT COUNT(*) FROM `publicdata.samples.shakespeare`)

101

answered Oct 07 '22 16:10

Felipe Hoffa

Related questions
                            
                                Install Google Cloud components error from gcloud command
                            
                                How to install the Google Cloud SDK in a Docker Image?
                            
                                How do I upload a base64 encoded image (string) directly to a Google Cloud Storage bucket using Node.js?
                            
                                Google Cloud Platform: how to monitor memory usage of VM instances
                            
                                Calling a Cloud Function from another Cloud Function
                            
                                How to SSH to docker container in kubernetes cluster? [closed]
                            
                                Stripe Error: No signatures found matching the expected signature for payload
                            
                                Cross project management using service account
                            
                                How to upload a file to Google Cloud Storage on Python 3?
                            
                                How do I identify the Google Cloud Storage URI from my Google Developers Console?
                            
                                Is GCM (now FCM) free for any limit? [closed]
                            
                                Get root password for Google Cloud Engine VM
                            
                                Read csv from Google Cloud storage to pandas dataframe
                            
                                Google server putty connect 'Disconnected: No supported authentication methods available (server sent: publickey)
                            
                                What is the difference between Google Cloud Dataflow and Google Cloud Dataproc?
                            
                                Cloud Firestore deep get with subcollection
                            
                                How to share storage between Kubernetes pods?
                            
                                OAuth consent screen - ability to remove application logo
                            
                                `docker-credential-gcloud` not in system PATH
                            
                                GCP error: Quota 'GPUS_ALL_REGIONS' exceeded. Limit: 0.0 globally

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With