Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dbt_utils.surrogate_key() with all fields except two

Tags:

jinja2

dbt

I am currently playing with dbt and trying to build a PSA (Persistent Staging Area) and the snapshot functionality lends itself well to this in my eyes. However, I don't have timestamps in the sources, so I have to use the "check" strategy.

For "check_cols" I would like to use a hash value, so I thought of dbt_utils.surrogate_key().

But I would like to calculate the hash value over all columns except the two columns that are always the same.

So my model looks like this:

{% snapshot Item_hist %}
{{
    config(
      unique_key='item_id',
      strategy='check',
      check_cols=['diff_hash'],
      target_database='PSA',
      target_schema='sourceA',
      alias= 'Item',
      invalidate_hard_deletes=True
    )
}}

select {{ dbt_utils.surrogate_key(['Tenant','ItemNo']) }} as item_id,
{{ dbt_utils.surrogate_key( dbt_utils.star(from=source('sourceA', 'item'), except=["fieldA", "fieldB"]) ) }} as diff_hash,
* 
from {{ source('sourceA', 'item') }}
{% endsnapshot %}

Unfortunately, the dbt_utils.surrogate_key() cannot handle the return value of dbt_utils.star(). How could I proceed here so that surrogate_key() can calculate a hash value from the return?

like image 772
joboro Avatar asked Sep 06 '25 03:09

joboro


1 Answers

Checking the source code, dbt_utils.star, returns the columns in the format

  `col1` as `col1`,
  `col2` as `col2`,
  ...

The quotes depend on the adapter (BigQuery in the example) and the alias using as is included because the method can get suffix and prefix parameters

dbt_utils.surrogate_key is expecting a list of values, instead of the string returned by dbt_utils.star

I have found a solution in dbt Slack group using get_columns_in_relation

Thanks to Lee Werner who posted the solution:

{%- set columns = get_columns_in_relation(ref('my_model')) -%}

{% set column_names = columns|map(attribute='name')|list %}

SELECT
{{ dbt_utils.surrogate_key( column_names )}} as HashID,
*
FROM {{ref('my_model')}}

In your case, you want to ignore some columns, so we can apply a reject filter to the list

{% snapshot Item_hist %}
{{
    config(
      unique_key='item_id',
      strategy='check',
      check_cols=['diff_hash'],
      target_database='PSA',
      target_schema='sourceA',
      alias= 'Item',
      invalidate_hard_deletes=True
    )
}}

{% set columns = get_columns_in_relation(source('sourceA', 'item')) -%}
{% set column_names = columns|map(attribute='name')|reject("in", ['fieldA','fieldB'])|list %}

select {{ dbt_utils.surrogate_key(['Tenant','ItemNo']) }} as item_id,
{{ dbt_utils.surrogate_key(column_names) }} as diff_hash,
* 
from {{ source('sourceA', 'item') }}
{% endsnapshot %}
like image 153
Diego Rodríguez Avatar answered Sep 07 '25 22:09

Diego Rodríguez