Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Shuffle failed with error" message when trying to use GROUP BY to return a tables disinct rows

We have a 1.01TB table with known duplicates we are trying to de-duplicate using GROUP EACH BY

There is an error message we'd like some help deciphering

Query Failed

Error: Shuffle failed with error: Cannot shuffle more than 3.00T in a single shuffle. One of the shuffle partitions in this query exceeded 3.84G. Strategies for working around this error are available on the go/dremelfaq.

Job ID: job_MG3RVUCKSDCEGRSCSGA3Z3FASWTSHQ7I

The query as you'd imagine does quite a bit and looks a little something like this

SELECT Twenty, Different, Columns, For, Deduping, ... 
       including_some, INTEGER(conversions),
       plus_also, DAYOFWEEK(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
       and_also, HOUROFDAY(SEC_TO_TIMESTAMP(INTEGER(timestamp_as_string)), conversions,
       and_a, IF(REGEXP_MATCH(long_string_field,r'ab=(\d+)'),TRUE, NULL) as flag_for_counting,
       with_some, joined, reference, columns,
       COUNT(*) as duplicate_count
FROM [MainDataset.ReallyBigTable] as raw
LEFT OUTER JOIN [RefDataSet.ReferenceTable] as ref
ON ref.id = raw.refid
GROUP EACH BY ... all columns in the select bar the count...

Question

What does this error mean? Is it trying to do this kind of shuffling? ;-) And finally, is the dremelfaq referenced in the error message available outside of Google and would it help understand whats going on?

Side Note

For completeness we tried a more modest GROUP EACH

 SELECT our, twenty, nine, string, column, table,
        count(*) as dupe_count
 FROM [MainDataSet.ReallyBigTable] 
 GROUP EACH BY all, those, twenty, nine, string, columns

And we receive a more subtle

Error: Resources exceeded during query execution.

Job ID: job_D6VZEHB4BWZXNMXMMPWUCVJ7CKLKZNK4

Should Bigquery be able to perform these kind of de-duplication queries? How should we best approach this problem?

like image 443
neversleepz Avatar asked Nov 21 '13 02:11

neversleepz


1 Answers

Actually, the shuffling involved is closer to this: http://www.youtube.com/watch?v=KQ6zr6kCPj8.

When you use the 'EACH' keyword, you're instructing the query engine to shuffle your data... you can think of it as a giant sort operation.

This is likely pushing close to the cluster limits that we've set in BigQuery. I'll talk to some of the other folks on the BigQuery team to see if there is a way we can figure out how to make your query work.

In the mean time, one option would be to partition your data into smaller tables and do the deduping on those smaller tables, then use table copy/append operations to create your final output table. To partition your data, you can do something like:

(SELECT * from [your_big_table] WHERE ABS(HASH(column1) % 10) == 1)

Unfortunately, this is going to be expensive, since it will require running the query over your 1 TB table 10 times.

like image 52
Jordan Tigani Avatar answered Nov 15 '22 08:11

Jordan Tigani