Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does RAND() works in BigQuery?

I am trying to find the best sampling practise in BigQuery. My dataset is quite big (11B rows), but the distribution tends to be skewed. So far I've been exploring these two options:

  1. HASHING - where I take the hash of a certain value to select the sample. This is pretty straightforward approach and the mechanics behind it are clear. My question is about the second option:
  2. using RAND() function. I understand how to use it by looking at the BigQuery reference here: https://cloud.google.com/bigquery/docs/reference/legacy-sql#rand However, I have no idea how exactly is this function working.

Can anyone shed some more light on the background stuff that are happening there?

Thanks a lot, Gallory

like image 667
Gallory Knox Avatar asked Feb 08 '17 14:02

Gallory Knox


People also ask

How do you use rand in BigQuery?

You can sample individual rows by using the WHERE rand() < K clause instead of the TABLESAMPLE clause. However, Google BigQuery will have to scan the entire table with the WHERE rand() < K clause, increasing your cost. You can work in your budget and still benefit from Row-Level Sampling by combining both techniques.

How do you do random sampling in BigQuery?

If you want to sample individual rows, rather than data blocks, then you can use a WHERE rand() < K clause instead. However, this approach requires BigQuery to scan the entire table. To save costs but still benefit from row-level sampling, you can combine both techniques.

What is NaN in BigQuery?

As per https://www.json.org/json-en.html, valid JSON values can only be string, number, true or false or null. Hence NaN is interpreted by BigQuery as null since it is considered as an invalid value. A value can be a string in double quotes, or a number, or true or false or null, or an object or an array.


1 Answers

My answer will apply to BigQuery Standard SQL. RAND() function generates a pseudo-random value of type FLOAT64 in the range of [0, 1), inclusive of 0 and exclusive of 1. The way you would use it for sampling is similar to how you would use FARM_FINGERPRINT function, but you don't need to specify any existing key. RAND() provides uniform distribution, so if some columns have skew, same skew is expected in the sample. Example of sampling 10% of the data in the table:

SELECT * FROM Table WHERE RAND() < 0.1
like image 184
Mosha Pasumansky Avatar answered Sep 21 '22 00:09

Mosha Pasumansky