We have a 1.01TB table with known duplicates we are trying to de-duplicate using GROUP EACH BY
There is an error message we'd like some help deciphering
Query Failed
Error: Shuffle failed with error: Cannot shuffle more than 3.00T in a single shuffle. One of the shuffle partitions in this query exceeded 3.84G. Strategies for working around this error are available on the go/dremelfaq.
Job ID: job_MG3RVUCKSDCEGRSCSGA3Z3FASWTSHQ7I
The query as you'd imagine does quite a bit and looks a little something like this
SELECT Twenty, Different, Columns, For, Deduping, ...
including_some, INTEGER(conversions),
plus_also, DAYOFWEEK(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
and_also, HOUROFDAY(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
and_a, IF(REGEXP_MATCH(long_string_field,r'ab=(\d+)'),TRUE, NULL) as flag_for_counting,
with_some, joined, reference, columns,
COUNT(*) as duplicate_count
FROM [MainDataset.ReallyBigTable] as raw
LEFT OUTER JOIN [RefDataSet.ReferenceTable] as ref
ON ref.id = raw.refid
GROUP EACH BY ... all columns in the select bar the count...
What does this error mean? Is it trying to do this kind of shuffling? ;-) And finally, is the dremelfaq referenced in the error message available outside of Google and would it help understand whats going on?
For completeness we tried a more modest GROUP EACH
SELECT our, twenty, nine, string, column, table,
count(*) as dupe_count
FROM [MainDataSet.ReallyBigTable]
GROUP EACH BY all, those, twenty, nine, string, columns
And we receive a more subtle
Error: Resources exceeded during query execution.
Job ID: job_D6VZEHB4BWZXNMXMMPWUCVJ7CKLKZNK4
Should Bigquery be able to perform these kind of de-duplication queries? How should we best approach this problem?
Actually, the shuffling involved is closer to this: http://www.youtube.com/watch?v=KQ6zr6kCPj8.
When you use the 'EACH' keyword, you're instructing the query engine to shuffle your data... you can think of it as a giant sort operation.
This is likely pushing close to the cluster limits that we've set in BigQuery. I'll talk to some of the other folks on the BigQuery team to see if there is a way we can figure out how to make your query work.
In the mean time, one option would be to partition your data into smaller tables and do the deduping on those smaller tables, then use table copy/append operations to create your final output table. To partition your data, you can do something like:
(SELECT * from [your_big_table] WHERE ABS(HASH(column1) % 10) == 1)
Unfortunately, this is going to be expensive, since it will require running the query over your 1 TB table 10 times.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With