Triage a failed dbt model

How-tos dbt

A dbt run failed at 3am. You are on call. This guide tells you what to do in the first five minutes, then the next twenty. It assumes you can read Python tracebacks and have kubectl or a runbook-equivalent handy.

Step 1: Identify the failure (30 seconds)

From the Airflow task log or the dbt CLI output, record four things:

Model name (for example fct_revenue).
Error class — Compilation, Database, Dependency, Freshness, Test, or Runtime.
First line of the error message.
Materialization type — view, table, incremental, materialized view.

Note

The error class is the single most important thing to get right. The recovery path is entirely different for "Jinja typo" versus "warehouse is down" versus "source has a new column". Do not skip classification to save time.

The raw signals:

# From wherever the dbt command ran
cat target/run_results.json | jq '.results[] | select(.status == "error")'

# Or with the command still open:
dbt run --select tag:daily 2>&1 | tail -30

Step 2: Classify

First-line pattern	Class	Fix-it lives in
`Compilation Error in model`	Compilation	Your code
`Database Error in model` + SQL	Database	Your SQL or your permissions
`depends on a node named X which was not found`	Dependency	Your refs or a deleted upstream
`Source freshness failure`	Freshness	Upstream data pipeline
`Failure in test`	Test	Data quality — see below
`Runtime Error` + connection	Runtime	Warehouse / infrastructure

Step 3: Check blast radius (30 seconds)

# What depends on the failed model?
dbt ls --select <failed_model>+ --output name | wc -l

If the number is above 50, the failure is a major incident. Page others before continuing. If the number is 1 or 2, you can keep triaging solo.

Step 4: Pick the recovery path

4a. Compilation error

Tip

These are almost always your own bug and almost always fast to fix. Run dbt compile --select <model> locally, read the expanded Jinja error, patch, re-push.

4b. Database error

Three sub-cases cover 90%:

Permission denied. Missing GRANT on the source or target. Fix in Unity Catalog:

GRANT SELECT ON TABLE prod.bronze.raw_events TO `dbt-service-principal`;
GRANT CREATE TABLE ON SCHEMA prod.silver TO `dbt-service-principal`;

Statement timeout. The warehouse cancelled the query.

Warning

Temptation: bump warehouse size and re-run. Resist. A timeout usually means a missing partition filter or an unbounded join, and a bigger warehouse just delays the real fix. Profile the compiled SQL first.

Databricks-specific error. Most common on incremental models:

Error	Cause	Fix
`DELTA_MISSING_COLUMN`	Source column referenced no longer exists	Check the compiled SQL; fix the model; `--full-refresh` if needed
`MERGE_CARDINALITY_VIOLATION`	`unique_key` is not actually unique in the batch	Add a deduplication CTE before merge
`SCHEMA_CHANGE_NOT_ALLOWED`	Target schema differs from model's output	Set `on_schema_change: 'sync_all_columns'` or `--full-refresh`
`WAREHOUSE_NOT_RUNNING`	SQL Warehouse stopped	Start it; check auto-suspend

4c. Test failure

# Capture the offending rows
dbt test --select <model> --store-failures

# Then inspect
SELECT * FROM prod.dbt_test_audit.<test_name> ORDER BY _loaded_at DESC LIMIT 100;

By test type:

Test	If failing, it means	Next action
`not_null`	Source has nulls in a column you declared not-null	Either the data is bad (fix upstream) or the schema is wrong (relax the test)
`unique`	Duplicates in what should be a key	Check for join fanout; deduplicate; verify upstream
`accepted_values`	Value outside the known set	Source added a new enum; update the accepted list or your handling
`relationships`	Broken referential integrity	Ordering issue (dim loaded after fact) or actual FK violation

Danger

Never "temporarily" disable a failing test to unblock prod without filing a data quality ticket. The test exists because someone once got burned. Disabling it rebuilds their burn scar.

4d. Incremental model failure

Decide whether to --full-refresh:

Scenario	Full refresh needed?
New additive column upstream	No — `append_new_columns` handles it
Column type changed	Yes
Column removed upstream	Yes
Incremental logic fix	Yes
Historical source data corrected	Yes, or targeted `replace_where` backfill
Normal daily run just failed once	No — retry

dbt run --select <model> --full-refresh

Warning

--full-refresh rewrites the entire table. On a ten-billion-row fact this is expensive. Schedule off-peak or scope the rewrite with replace_where.

4e. Source freshness

dbt source freshness --select source:bronze.raw_transactions

If the upstream ingestion is late, fix the pipeline, not dbt. The freshness failure is the correct signal; treating it as a dbt problem just hides the real issue.

Step 5: Re-run efficiently

# Just the failed model
dbt run --select <model>

# Failed model + its downstream (most common)
dbt run --select <model>+

# Only models in error state from the last run
dbt run --select result:error --state ./target/

# State-based: modified models against prod
dbt run --select state:modified+ --defer --state ./prod-manifest/

Important

Always prefer <model>+ over a blanket dbt run. Rebuilding unrelated models during a recovery adds cost, hides the actual fix, and occasionally makes things worse if another unrelated model has a silent issue.

Step 6: Verify recovery

Before you declare the incident over:

All previously failed models show success in the latest run.
All tests pass on the recovered models.
Row counts are reasonable — not zero, not doubled.
Source freshness is within SLA.
Downstream consumers have been notified of any data delay.

Triage template

Fill this out every time:

Date/Time:         _______
Model:             _______
Materialization:   [ ] view  [ ] table  [ ] incremental  [ ] MV
Error class:       [ ] Compile  [ ] DB  [ ] Dep  [ ] Fresh  [ ] Test  [ ] Runtime
First-line error:  _______
Blast radius:      ___ downstream models
Resolution:        [ ] Code fix  [ ] Full refresh  [ ] Permissions  [ ] Upstream  [ ] Escalate
Resolution detail: _______
Recovery command:  _______

Attach the completed template to the incident record. Future you (or the next on-call) will thank you.

Step 1: Identify the failure (30 seconds)

Step 2: Classify

Step 3: Check blast radius (30 seconds)

Step 4: Pick the recovery path

4a. Compilation error

4b. Database error

4c. Test failure

4d. Incremental model failure

4e. Source freshness

Step 5: Re-run efficiently

Step 6: Verify recovery

Triage template

See also