I have a table with ~1 million rows in Google BigQuery from the NYC Yellow TaxiCab public dataset. As you can see from that link, the schema does not have a primary key. Each row represents a trip/transaction, but there is no customer_id
field.
I want to add a column customer_id
and distribute random numbers to it such that:
For rows 1-20, `customer_id` should be assigned `1`
For rows 21-40, `customer_id` should be assigned `2`
and so on..
In other words, I want exactly (and any) 20 rows in the table to have a specific value of customer_id
.
Assign each row a random id, getting groups of ~20 rows for each new_id
:
#standardSQL
SELECT CAST(FLOOR(COUNT(*) OVER()/20*RAND()) AS INT64) new_id, *
FROM (
SELECT login
FROM `ghtorrent-bq.ght_2017_04_01.users`
LIMIT 1000000
)
Proof that 50,000 "customers_ids" are generated when going over a million rows:
Below is for BigQuery Standard SQL and produces exactly 20 entries per customer_id
#standardSQL
SELECT DIV(ROW_NUMBER() OVER() - 1, 20) AS customer_id, *
FROM `yourTable`
-- ORDER BY customer_id
You can test play with this using dummy data as below
#standardSQL
WITH `yourTable` AS (
SELECT login
FROM `ghtorrent-bq.ght_2017_04_01.users`
LIMIT 1000000
)
SELECT DIV(ROW_NUMBER() OVER() - 1, 20) AS customer_id, *
FROM `yourTable`
-- ORDER BY customer_id
Moreover - below query shows the distribution of counts per customer_id
#standardSQL
WITH `yourTable` AS (
SELECT login
FROM `ghtorrent-bq.ght_2017_04_01.users`
LIMIT 1000000
)
SELECT cnt, COUNT(1) AS distribution FROM (
SELECT customer_id, COUNT(1) AS cnt FROM (
SELECT *, DIV(ROW_NUMBER() OVER() - 1, 20) AS customer_id
FROM `yourTable`
ORDER BY customer_id
)
GROUP BY customer_id
)
GROUP BY cnt
ORDER BY cnt
with output as below
Row cnt distribution
--- --- ------------
1 20 50000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With