Build your first incremental dbt model

How-tos dbt

You built a fact table as materialized: table. Every night it rebuilds the full history, takes twenty minutes, and costs more than it should. This guide walks through converting it to an incremental model on Databricks, with the decisions you have to make at each step.

Before you start

Incremental models are a step up in complexity. Four things must be true before it is worth the cost:

The table is large. Rule of thumb: full refresh takes longer than ten minutes, or the underlying source has more than a hundred million rows.
The source has a reliable watermark column (a monotonic timestamp like updated_at or _loaded_at).
You can answer "what is a unique row?" confidently. That answer becomes your unique_key.
You are willing to run --full-refresh occasionally. Incremental models are not a silver bullet; they accumulate drift.

Note

If any of those four are not true, stick with table for now. A correct full-refresh model beats a subtly broken incremental one every time.

1. Pick the incremental strategy

dbt-databricks supports six strategies. The first four cover 95% of cases.

Strategy	Use when	Cost
`merge`	The source has updates to existing rows (SCD-1, upserts).	Highest writes.
`append`	Events are immutable once written. Pure inserts.	Cheapest.
`insert_overwrite`	You reprocess whole partitions, often "today".	Low for the touched partition.
`replace_where`	Rewrite a predicate-scoped slice like "last 3 days".	Low; flexible.

Start with merge if you have updates, append if you do not. You can always switch later.

2. Configure the model

Here is the annotated shape. Your fact table probably has different column names.

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='order_id',
    on_schema_change='append_new_columns',
    tblproperties={'delta.autoOptimize.optimizeWrite': 'true'}
) }}

with source as (
    select * from {{ ref('stg_orders') }}
),
ranked as (
    select *,
           row_number() over (
               partition by order_id
               order by updated_at desc
           ) as rn
    from source
    {% if is_incremental() %}
      where updated_at >= (
          select coalesce(max(updated_at), '1900-01-01')
          from {{ this }}
      )
    {% endif %}
),
final as ( select * except(rn) from ranked where rn = 1 )

select * from final

Four pieces do real work:

unique_key — what identifies a row uniquely. For an order it is order_id; for an event table, typically event_id.
is_incremental() block — filters the source to rows the target does not have yet. On the first run (table does not exist), this block is skipped and everything loads.
on_schema_change: append_new_columns — if upstream adds a new column, add it to the target rather than failing. Additive-only; type changes still require --full-refresh.
Deduplication via row_number — protects against duplicate-arrival rows from CDC. Without it, merge raises MERGE_CARDINALITY_VIOLATION.

Warning

MERGE_CARDINALITY_VIOLATION at runtime means your source had more than one row per unique_key in the batch being merged. Always deduplicate in the model, even if you believe the source is clean. CDC pipelines re-send rows; late-arriving rows get repeated.

3. Scope `is_incremental()` aggressively

The single most important line is the where clause inside is_incremental(). If it is loose, the model reads the entire source on every run and you just bought incremental complexity for no speed.

Danger

where updated_at > now() - interval 7 days looks reasonable and is silently wrong. If a prod run is delayed by eight days, you lose rows forever. Always anchor the predicate to a value already in {{ this }}: max(updated_at) from {{ this }}. That way missed runs automatically backfill on the next invocation.

4. First run

With an empty target, dbt builds the whole table. This takes as long as a table full refresh did. Subsequent runs only touch new rows.

dbt build --select fct_orders

Verify the model ran as incremental on the second run:

dbt run --select fct_orders
# Look for: "1 of 1 OK incremental model dev.marts.fct_orders"

5. When to `--full-refresh`

Incremental models accumulate drift. Rebuild them from scratch when:

You change the column list, change a column type, or change business logic.
Upstream source data was corrected for historical dates.
on_schema_change: fail triggered; you understood the change; now you want to apply it.

dbt run --select fct_orders --full-refresh

Warning

Full refresh rewrites the whole table. On a ten-billion-row fact table that is an expensive operation. Run during off-peak if possible, or use replace_where with a targeted predicate so you only rewrite the affected slice.

6. Add tests

An incremental model needs a unique test on the unique key and a not_null test on everything declared not-null. Declare them in the model's yml file:

version: 2
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: updated_at
        tests:
          - not_null

dbt build runs these automatically after the model compiles. A test failure halts downstream models, which is almost always what you want.

Common mistakes

Symptom	Root cause
Incremental is slower than `table`	`is_incremental()` filter is too loose; the model reads the full source anyway.
`MERGE_CARDINALITY_VIOLATION`	Source has duplicates for the unique_key. Deduplicate before merge.
Missing rows after a missed run	`is_incremental()` filter was anchored to wall-clock time instead of `max(updated_at) from {{ this }}`.
Schema drift error	Upstream added a column; `on_schema_change` is `fail` or `ignore`.
`DELTA_MISSING_COLUMN`	Upstream removed a column the target still expects. Needs `--full-refresh` with the updated model.

Before you start

1. Pick the incremental strategy

2. Configure the model

3. Scope is_incremental() aggressively

4. First run

5. When to --full-refresh

6. Add tests

Common mistakes

See also

3. Scope `is_incremental()` aggressively

5. When to `--full-refresh`