Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Joins on Google Bigquery

I know that work is being done to improve the Join feature on Bigquery, not to rant here but it will be hard to analyze 'Terabyte' sets of data as 'advertised' if Joins can not be used properly.

OK, back to the problem, I have two tables one is 600 Megs and the other one is 50Megs, I did try to make a join and I got an error about smaller table must be left. I did some research and I found out that Bigquery considers both tables as big if they are greater than 7MB?

So based on some advice I found online I did select the dataset in question of the smaller table and saved it in a new table, the new data set is 12MB and 600K rows over two columns. Then I tried my query again but I am still getting the same error :

Query Failed : Error: Large table cdrs_test.geoIP_Left must appear as the leftmost table in a join query

geoIP_Left is 12 Megs, the other table is 600Megs..

Can this be fixed somehow or am I busted? If so anyone knows about any other service that I can use to analyse big sets of data that does support joins?

EDIT: This is the actual query;

SELECT COUNT(results.cc_card) AS count,
       sum(results.sessiontime) AS time, 
       geoIP_Left.place AS place 
FROM cdrs_test.cdrs_2010_5 AS results 
JOIN cdrs_test.geoIP_Left AS geoIP_table 
  ON results.cc_card = geoIP_table.vcard 
WHERE results.sessiontime > 0 AND results.countryName Contains 'India' 
GROUP BY place;
like image 969
user1838066 Avatar asked Nov 20 '12 08:11

user1838066


2 Answers

You could express this query as a subselect that references only the columns from cdrs_test.geoIP_Left that you are interested in. See the second example here.

SELECT
  COUNT(results.cc_card) AS count,
  sum(results.sessiontime) AS time, 
  geoIP_table.place AS place 
FROM
  cdrs_test.cdrs_2010_5 AS results 
JOIN
  (SELECT place, vcard FROM cdrs_test.geoIP_Left)
AS
  geoIP_table 
ON
  results.cc_card = geoIP_table.vcard 
WHERE
  results.sessiontime > 0 AND results.countryName CONTAINS 'India' 
GROUP BY
  place;

You can also simply run multiple queries - note that you can save query results explicitly as a named table, and use that table in a later query.

Finally, another option would be to pre-join the data, using (for example) a MapReduce transformation pipeline, or an ETL tool provided by one of the BigQuery ETL partners.

like image 143
Michael Manoochehri Avatar answered Sep 17 '22 19:09

Michael Manoochehri


As mentioned by Felipe in the comment. BigQuery has allowed a modifier "EACH" to JOIN to allow JOINs of 2 big tables. From the Query Reference page -

Normal JOIN operations require that the right-side table contains less than 8 MB of compressed data. The EACH modifier is a hint that informs the query execution engine that the JOIN might reference two large tables. The EACH modifier can't be used in CROSS JOIN clauses.

When possible, use JOIN without the EACH modifier for best performance. Use JOIN EACH when table sizes are too large for JOIN.

like image 28
Vaibhav Avatar answered Sep 18 '22 19:09

Vaibhav