You built a fact table as materialized: table. Every night it rebuilds the full history, takes twenty minutes, and costs more than it should. This guide walks through converting it to an incremental model on Databricks, with the decisions you have to make at each step.

Before you start

Incremental models are a step up in complexity. Four things must be true before it is worth the cost:

Note

If any of those four are not true, stick with table for now. A correct full-refresh model beats a subtly broken incremental one every time.

1. Pick the incremental strategy

dbt-databricks supports six strategies. The first four cover 95% of cases.

StrategyUse whenCost
mergeThe source has updates to existing rows (SCD-1, upserts).Highest writes.
appendEvents are immutable once written. Pure inserts.Cheapest.
insert_overwriteYou reprocess whole partitions, often "today".Low for the touched partition.
replace_whereRewrite a predicate-scoped slice like "last 3 days".Low; flexible.

Start with merge if you have updates, append if you do not. You can always switch later.

2. Configure the model

Here is the annotated shape. Your fact table probably has different column names.

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='order_id',
    on_schema_change='append_new_columns',
    tblproperties={'delta.autoOptimize.optimizeWrite': 'true'}
) }}

with source as (
    select * from {{ ref('stg_orders') }}
),
ranked as (
    select *,
           row_number() over (
               partition by order_id
               order by updated_at desc
           ) as rn
    from source
    {% if is_incremental() %}
      where updated_at >= (
          select coalesce(max(updated_at), '1900-01-01')
          from {{ this }}
      )
    {% endif %}
),
final as ( select * except(rn) from ranked where rn = 1 )

select * from final

Four pieces do real work:

  1. unique_key — what identifies a row uniquely. For an order it is order_id; for an event table, typically event_id.
  2. is_incremental() block — filters the source to rows the target does not have yet. On the first run (table does not exist), this block is skipped and everything loads.
  3. on_schema_change: append_new_columns — if upstream adds a new column, add it to the target rather than failing. Additive-only; type changes still require --full-refresh.
  4. Deduplication via row_number — protects against duplicate-arrival rows from CDC. Without it, merge raises MERGE_CARDINALITY_VIOLATION.

Warning

MERGE_CARDINALITY_VIOLATION at runtime means your source had more than one row per unique_key in the batch being merged. Always deduplicate in the model, even if you believe the source is clean. CDC pipelines re-send rows; late-arriving rows get repeated.

3. Scope is_incremental() aggressively

The single most important line is the where clause inside is_incremental(). If it is loose, the model reads the entire source on every run and you just bought incremental complexity for no speed.

Danger

where updated_at > now() - interval 7 days looks reasonable and is silently wrong. If a prod run is delayed by eight days, you lose rows forever. Always anchor the predicate to a value already in {{ this }}: max(updated_at) from {{ this }}. That way missed runs automatically backfill on the next invocation.

4. First run

With an empty target, dbt builds the whole table. This takes as long as a table full refresh did. Subsequent runs only touch new rows.

dbt build --select fct_orders

Verify the model ran as incremental on the second run:

dbt run --select fct_orders
# Look for: "1 of 1 OK incremental model dev.marts.fct_orders"

5. When to --full-refresh

Incremental models accumulate drift. Rebuild them from scratch when:

dbt run --select fct_orders --full-refresh

Warning

Full refresh rewrites the whole table. On a ten-billion-row fact table that is an expensive operation. Run during off-peak if possible, or use replace_where with a targeted predicate so you only rewrite the affected slice.

6. Add tests

An incremental model needs a unique test on the unique key and a not_null test on everything declared not-null. Declare them in the model's yml file:

version: 2
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: updated_at
        tests:
          - not_null

dbt build runs these automatically after the model compiles. A test failure halts downstream models, which is almost always what you want.

Common mistakes

SymptomRoot cause
Incremental is slower than tableis_incremental() filter is too loose; the model reads the full source anyway.
MERGE_CARDINALITY_VIOLATIONSource has duplicates for the unique_key. Deduplicate before merge.
Missing rows after a missed runis_incremental() filter was anchored to wall-clock time instead of max(updated_at) from {{ this }}.
Schema drift errorUpstream added a column; on_schema_change is fail or ignore.
DELTA_MISSING_COLUMNUpstream removed a column the target still expects. Needs --full-refresh with the updated model.

See also