I am trying to find the best sampling practise in BigQuery. My dataset is quite big (11B rows), but the distribution tends to be skewed. So far I've been exploring these two options:
Can anyone shed some more light on the background stuff that are happening there?
Thanks a lot, Gallory
You can sample individual rows by using the WHERE rand() < K clause instead of the TABLESAMPLE clause. However, Google BigQuery will have to scan the entire table with the WHERE rand() < K clause, increasing your cost. You can work in your budget and still benefit from Row-Level Sampling by combining both techniques.
If you want to sample individual rows, rather than data blocks, then you can use a WHERE rand() < K clause instead. However, this approach requires BigQuery to scan the entire table. To save costs but still benefit from row-level sampling, you can combine both techniques.
As per https://www.json.org/json-en.html, valid JSON values can only be string, number, true or false or null. Hence NaN is interpreted by BigQuery as null since it is considered as an invalid value. A value can be a string in double quotes, or a number, or true or false or null, or an object or an array.
My answer will apply to BigQuery Standard SQL. RAND() function generates a pseudo-random value of type FLOAT64 in the range of [0, 1), inclusive of 0 and exclusive of 1. The way you would use it for sampling is similar to how you would use FARM_FINGERPRINT function, but you don't need to specify any existing key. RAND() provides uniform distribution, so if some columns have skew, same skew is expected in the sample. Example of sampling 10% of the data in the table:
SELECT * FROM Table WHERE RAND() < 0.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With