DBT Snapshots with not unique records in the source

Question 1

I’m interested to know if someone here has ever come across a situation where the source is not always unique when dealing with snapshots in DBT.
I have a data lake where data arrives on an append only basis. Every time the source is updated, a new recorded is created on the respective table in the data lake.
By the time the DBT solution is ran, my source could have more than 1 row with the unique id as the data has changed more than once since the last run.
Ideally, I’d like to update the respective dbt_valid_to columns from the snapshot table with the earliest updated_at record from the source and subsequently add the new records to the snapshot table making the latest updated_at record the current one. I know how to achieve this using window functions but not sure how to handle such situation with dbt. I wonder if anybody has faced this same issue before.

Snapshot Table

| **id** |   **some_attribute** |   **valid_from**      |   **valid_to**          |
|  123   |      ABCD            |   2021-01-01 00:00:00 |    2021-06-30 00:00:00  | 
|  123   |      ZABC            |   2021-06-30 00:00:00 |      null               |

Source Table

|**id**|**some_attribute**|   **updated_at**    |
| 123  |   ABCD           | 2021-01-01 00:00:00 |-> already been loaded to snapshot
| 123  |   ZABC           | 2021-06-30 00:00:00 |-> already been loaded to snapshot
 -------------------------------------------
| 123  |   ZZAB           | 2021-11-21 00:10:00 |
| 123  |   FXAB           | 2021-11-21 15:11:00 |

Snapshot Desired Result

| **id** |   **some_attribute** |   **valid_from**      |   **valid_to**          |
|  123   |      ABCD            |   2021-01-01 00:00:00 |    2021-06-30 00:00:00  | 
|  123   |      ZABC            |   2021-06-30 00:00:00 |    2021-11-21 00:10:00  |
|  123   |      ZZAB            |   2021-11-21 00:10:00 |    2021-11-21 15:11:00  | 
|  123   |      FXAB            |   2021-11-21 15:11:00 |    null                 |

Question 2

Standard snapshots operate under the assumption that the source table we are snapshotting are being changed without storing history. This is opposed to the behaviour we have here (basically the source table we are snapshotting is nothing more than an append only log of events) - which means that we may get away with simply using a boring old incremental model to achieve the same SCD2 outcome that snapshots give us.

I have some sample code here where I did just that that may be of some help https://gist.github.com/jeremyyeo/3a23f3fbcb72f10a17fc4d31b8a47854

Question 3

I agree it would be very convenient if dbt snapshots had a strategy that could involve deduplication, but it isn’t supported today.

The easiest work around would be a stage view downstream of the source that has the window function you describe. Then you snapshot that view.

However, I do see potential for a new snapshot strategy that handles append only sources. Perhaps you’d like to peruse the dbt Snapshot docs and strategies source code on existing strategies to see if you’d like to make a new one!

Jeremy Yeo · Answer 1 · 2021-11-30T22:19:55

Standard snapshots operate under the assumption that the source table we are snapshotting are being changed without storing history. This is opposed to the behaviour we have here (basically the source table we are snapshotting is nothing more than an append only log of events) - which means that we may get away with simply using a boring old incremental model to achieve the same SCD2 outcome that snapshots give us.

I have some sample code here where I did just that that may be of some help https://gist.github.com/jeremyyeo/3a23f3fbcb72f10a17fc4d31b8a47854

Anders Swanson · Answer 2 · 2021-11-24T01:37:58

I agree it would be very convenient if dbt snapshots had a strategy that could involve deduplication, but it isn’t supported today.

The easiest work around would be a stage view downstream of the source that has the window function you describe. Then you snapshot that view.

However, I do see potential for a new snapshot strategy that handles append only sources. Perhaps you’d like to peruse the dbt Snapshot docs and strategies source code on existing strategies to see if you’d like to make a new one!

DBT Snapshots with not unique records in the source

In other languages

This page is in other languages

Popular in the category