I would like to split a table with 120 million rows into multiple equally-sized tables in Google BigQuery. I found this answer to a question that asks about getting row numbers for a table. The example solution was:
SELECT
ROW_NUMBER() OVER() row_number,
contributor_username,
FROM (
SELECT contributor_username,
FROM [publicdata:samples.wikipedia]
GROUP BY contributor_username
)
which would allow you to get a row_number, and then you could create a table by manually selecting
WHERE row_number BETWEEN x AND y
This works for the table in the example, but using ROW_NUMBER() yields a resources exceeded error for a table with 117MM rows.
I then tried the following idea:
SELECT field1, field2
FROM (
SELECT field1, field2, ntile(100) over (order by row_id ) as ntile
FROM (
SELECT hash(some_unique_key) as row_id, field1, field2
FROM table_with_177_MM_rows
)
)
WHERE ntile = 1
I would then need to run this for ntile=[1,...,100]. This also failed. Is there any way to split a large table with a unique key in GBQ into smaller tables?
I tried below on table with 500 million rows as well as on table with 3 billion rows and it worked as expected
First you create new table (temp_table) with extra field rnd
SELECT
field1, field2,
RAND() AS rnd
FROM YourBigTable
Than you run something like below - depends on how many tables you want to split to - as many times as needed.
SELECT
field1, field2
FROM temp_table
WHERE rnd >= 0.3 AND rnd < 0.4
of course you should make sure you set destination tables respectively
You delete temp_table after all done
Hard to say if it is better way or not though. Up to you!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With