In this blog a Google Cloud employee explains how to do repeatable sampling of data sets for machine learning in BigQuery. This is very important for creating (and replicating) train/validation/test partitions of your data.
However the blog uses Legacy SQL, which Google has now deprecated in favor of Standard SQL.
How would you re-write the blog's sampling code shown below, but using Standard SQL?
#legacySQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
[bigquery-samples:airline_ontime_data.flights]
WHERE
ABS(HASH(date)) % 10 < 8
Standard SQL would re-write the query thus:
#standardSQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8
Specifically here are the changes:
MOD
function (not %
).FARM_FINGERPRINT
(not HASH
). This is actually a different hashing function than Legacy SQL's HASH
, which wasn't in fact consistent over time as the blog had implied. Based on accepted answer, provide a more general way to generate unique key for each row:
TO_JSON_STRING(STRUCT(col1, col2, ..., colN))
#standardSQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
ABS(MOD(FARM_FINGERPRINT(TO_JSON_STRING(STRUCT(date, airline, arrival_delay))), 10)) < 8
What if there is no unique key to identify each row?
Yes, it can happen that there are by design duplicated data rows in your dataset, with above query, either all of or none of the duplicates are included in sample set.
Depending on how big your dataset is, you can try to order the source dataset and using window function to generate a row_number for each row. Then do sampling based on row_number. This trick will work until you hit error sorting your dataset:
Resources exceeded during query execution: The query could not be executed in the allotted memory.
What if I do hit above error
Well, above way is simpler to implement but if you hit the limit, consider doing something more complex:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With