Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Farm_fingerprint for Google Big query

I have tables I am importing into Google Big Query. These tables are my staging tables and I am using Views to transform the staging tables into a logical warehouse table. My issue is surrogate keys.

Right now, ROW_NUMBER() is failing because my data is too large.

I have been recommended to use FARM_FINGERPRINT

Is there a proper way to employ Surrogate_key Generation in Big Query?

Thanks

UPDATE 1 for clarification.

Some of my source tables have 30 fields with millions of records. To big to display here. We are transforming those records into tables with the need to create surrogate keys to assist with BI tools later on.

To that end, our views (to replace target tables with) have the transformations, plus we were using ROW_NUMBER() functions to create unique surrogate keys. We realized that the function forces Big Query to push the entire logic onto one node, which crashes the query.

We are playing around with FARM_FINGERPRINT(), using a string of concatenated fields that represent a unique business key as an input to retrieve a INT64 value to use as a surrogate key. The belief is that this will provide an integer that we can use in the BI tools as needed by that team.

what I am asking is there is a proper way to generate an integer-based surrogate key in Google Big Query when you have such a large dataset?

Thanks.

like image 413
arcee123 Avatar asked Oct 26 '25 10:10

arcee123


1 Answers

Can you find a different strategy to partition your data?

This fails with "resources exceeded" (as expected - note that I'm not trying to de-duplicate them, but to number each of their mentions):

SELECT author
  , ROW_NUMBER() OVER(ORDER BY created_utc)
FROM `fh-bigquery.reddit_comments.2017_06`

But I could partition my authors by the first letter, to have the data fit in a partition:

SELECT author
  , ROW_NUMBER() OVER(PARTITION BY REGEXP_EXTRACT(author, '.') ORDER BY created_utc)
FROM `fh-bigquery.reddit_comments.2017_06`

This works! Now each row can have an id composed of first_letter_author+row_number:

SELECT *
  , CONCAT(
      first_letter
      , '-'
      , CAST(ROW_NUMBER() 
      OVER(PARTITION BY REGEXP_EXTRACT(author, '.') 
      ORDER BY created_utc) AS STRING)) id
FROM (
  SELECT author, created_utc, REGEXP_EXTRACT(author, r'^.') first_letter
  FROM `fh-bigquery.reddit_comments.2017_06`
)
like image 117
Felipe Hoffa Avatar answered Oct 29 '25 08:10

Felipe Hoffa



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!