You built a fact table as materialized: table. Every night it rebuilds the full history, takes twenty minutes, and costs more than it should. This guide walks through converting it to an incremental model on Databricks, with the decisions you have to make at each step.
Before you start
Incremental models are a step up in complexity. Four things must be true before it is worth the cost:
- The table is large. Rule of thumb: full refresh takes longer than ten minutes, or the underlying source has more than a hundred million rows.
- The source has a reliable watermark column (a monotonic timestamp like
updated_ator_loaded_at). - You can answer "what is a unique row?" confidently. That answer becomes your
unique_key. - You are willing to run
--full-refreshoccasionally. Incremental models are not a silver bullet; they accumulate drift.
Note
If any of those four are not true, stick with table for now. A correct full-refresh model beats a subtly broken incremental one every time.
1. Pick the incremental strategy
dbt-databricks supports six strategies. The first four cover 95% of cases.
| Strategy | Use when | Cost |
|---|---|---|
merge | The source has updates to existing rows (SCD-1, upserts). | Highest writes. |
append | Events are immutable once written. Pure inserts. | Cheapest. |
insert_overwrite | You reprocess whole partitions, often "today". | Low for the touched partition. |
replace_where | Rewrite a predicate-scoped slice like "last 3 days". | Low; flexible. |
Start with merge if you have updates, append if you do not. You can always switch later.
2. Configure the model
Here is the annotated shape. Your fact table probably has different column names.
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='order_id',
on_schema_change='append_new_columns',
tblproperties={'delta.autoOptimize.optimizeWrite': 'true'}
) }}
with source as (
select * from {{ ref('stg_orders') }}
),
ranked as (
select *,
row_number() over (
partition by order_id
order by updated_at desc
) as rn
from source
{% if is_incremental() %}
where updated_at >= (
select coalesce(max(updated_at), '1900-01-01')
from {{ this }}
)
{% endif %}
),
final as ( select * except(rn) from ranked where rn = 1 )
select * from final
Four pieces do real work:
unique_key— what identifies a row uniquely. For an order it isorder_id; for an event table, typicallyevent_id.is_incremental()block — filters the source to rows the target does not have yet. On the first run (tabledoes not exist), this block is skipped and everything loads.on_schema_change: append_new_columns— if upstream adds a new column, add it to the target rather than failing. Additive-only; type changes still require--full-refresh.- Deduplication via
row_number— protects against duplicate-arrival rows from CDC. Without it,mergeraisesMERGE_CARDINALITY_VIOLATION.
Warning
MERGE_CARDINALITY_VIOLATION at runtime means your source had more than one row per unique_key in the batch being merged. Always deduplicate in the model, even if you believe the source is clean. CDC pipelines re-send rows; late-arriving rows get repeated.
3. Scope is_incremental() aggressively
The single most important line is the where clause inside is_incremental(). If it is loose, the model reads the entire source on every run and you just bought incremental complexity for no speed.
Danger
where updated_at > now() - interval 7 days looks reasonable and is silently wrong. If a prod run is delayed by eight days, you lose rows forever. Always anchor the predicate to a value already in {{ this }}: max(updated_at) from {{ this }}. That way missed runs automatically backfill on the next invocation.
4. First run
With an empty target, dbt builds the whole table. This takes as long as a table full refresh did. Subsequent runs only touch new rows.
dbt build --select fct_orders
Verify the model ran as incremental on the second run:
dbt run --select fct_orders
# Look for: "1 of 1 OK incremental model dev.marts.fct_orders"
5. When to --full-refresh
Incremental models accumulate drift. Rebuild them from scratch when:
- You change the column list, change a column type, or change business logic.
- Upstream source data was corrected for historical dates.
on_schema_change: failtriggered; you understood the change; now you want to apply it.
dbt run --select fct_orders --full-refresh
Warning
Full refresh rewrites the whole table. On a ten-billion-row fact table that is an expensive operation. Run during off-peak if possible, or use replace_where with a targeted predicate so you only rewrite the affected slice.
6. Add tests
An incremental model needs a unique test on the unique key and a not_null test on everything declared not-null. Declare them in the model's yml file:
version: 2
models:
- name: fct_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: updated_at
tests:
- not_null
dbt build runs these automatically after the model compiles. A test failure halts downstream models, which is almost always what you want.
Common mistakes
| Symptom | Root cause |
|---|---|
Incremental is slower than table | is_incremental() filter is too loose; the model reads the full source anyway. |
MERGE_CARDINALITY_VIOLATION | Source has duplicates for the unique_key. Deduplicate before merge. |
| Missing rows after a missed run | is_incremental() filter was anchored to wall-clock time instead of max(updated_at) from {{ this }}. |
| Schema drift error | Upstream added a column; on_schema_change is fail or ignore. |
DELTA_MISSING_COLUMN | Upstream removed a column the target still expects. Needs --full-refresh with the updated model. |
See also
- Materializations reference — exhaustive options for each strategy.
- Model authoring standards — Causeway's rules for incremental models.
- Failure triage — what to do when an incremental run fails in prod.