I know that work is being done to improve the Join feature on Bigquery, not to rant here but it will be hard to analyze 'Terabyte' sets of data as 'advertised' if Joins can not be used properly.
OK, back to the problem, I have two tables one is 600 Megs and the other one is 50Megs, I did try to make a join and I got an error about smaller table must be left. I did some research and I found out that Bigquery
considers both tables as big if they are greater than 7MB?
So based on some advice I found online I did select the dataset in question of the smaller table and saved it in a new table, the new data set is 12MB and 600K rows over two columns. Then I tried my query again but I am still getting the same error :
Query Failed : Error: Large table cdrs_test.geoIP_Left must appear as the leftmost table in a join query
geoIP_Left is 12 Megs, the other table is 600Megs..
Can this be fixed somehow or am I busted? If so anyone knows about any other service that I can use to analyse big sets of data that does support joins?
EDIT: This is the actual query;
SELECT COUNT(results.cc_card) AS count,
sum(results.sessiontime) AS time,
geoIP_Left.place AS place
FROM cdrs_test.cdrs_2010_5 AS results
JOIN cdrs_test.geoIP_Left AS geoIP_table
ON results.cc_card = geoIP_table.vcard
WHERE results.sessiontime > 0 AND results.countryName Contains 'India'
GROUP BY place;
You could express this query as a subselect that references only the columns from cdrs_test.geoIP_Left
that you are interested in. See the second example here.
SELECT
COUNT(results.cc_card) AS count,
sum(results.sessiontime) AS time,
geoIP_table.place AS place
FROM
cdrs_test.cdrs_2010_5 AS results
JOIN
(SELECT place, vcard FROM cdrs_test.geoIP_Left)
AS
geoIP_table
ON
results.cc_card = geoIP_table.vcard
WHERE
results.sessiontime > 0 AND results.countryName CONTAINS 'India'
GROUP BY
place;
You can also simply run multiple queries - note that you can save query results explicitly as a named table, and use that table in a later query.
Finally, another option would be to pre-join the data, using (for example) a MapReduce transformation pipeline, or an ETL tool provided by one of the BigQuery ETL partners.
As mentioned by Felipe in the comment. BigQuery has allowed a modifier "EACH" to JOIN to allow JOINs of 2 big tables. From the Query Reference page -
Normal JOIN operations require that the right-side table contains less than 8 MB of compressed data. The EACH modifier is a hint that informs the query execution engine that the JOIN might reference two large tables. The EACH modifier can't be used in CROSS JOIN clauses.
When possible, use JOIN without the EACH modifier for best performance. Use JOIN EACH when table sizes are too large for JOIN.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With