A dbt run failed at 3am. You are on call. This guide tells you what to do in the first five minutes, then the next twenty. It assumes you can read Python tracebacks and have kubectl or a runbook-equivalent handy.
Step 1: Identify the failure (30 seconds)
From the Airflow task log or the dbt CLI output, record four things:
- Model name (for example
fct_revenue). - Error class — Compilation, Database, Dependency, Freshness, Test, or Runtime.
- First line of the error message.
- Materialization type — view, table, incremental, materialized view.
Note
The error class is the single most important thing to get right. The recovery path is entirely different for "Jinja typo" versus "warehouse is down" versus "source has a new column". Do not skip classification to save time.
The raw signals:
# From wherever the dbt command ran
cat target/run_results.json | jq '.results[] | select(.status == "error")'
# Or with the command still open:
dbt run --select tag:daily 2>&1 | tail -30
Step 2: Classify
| First-line pattern | Class | Fix-it lives in |
|---|---|---|
Compilation Error in model | Compilation | Your code |
Database Error in model + SQL | Database | Your SQL or your permissions |
depends on a node named X which was not found | Dependency | Your refs or a deleted upstream |
Source freshness failure | Freshness | Upstream data pipeline |
Failure in test | Test | Data quality — see below |
Runtime Error + connection | Runtime | Warehouse / infrastructure |
Step 3: Check blast radius (30 seconds)
# What depends on the failed model?
dbt ls --select <failed_model>+ --output name | wc -l
If the number is above 50, the failure is a major incident. Page others before continuing. If the number is 1 or 2, you can keep triaging solo.
Step 4: Pick the recovery path
4a. Compilation error
Tip
These are almost always your own bug and almost always fast to fix. Run dbt compile --select <model> locally, read the expanded Jinja error, patch, re-push.
4b. Database error
Three sub-cases cover 90%:
Permission denied. Missing GRANT on the source or target. Fix in Unity Catalog:
GRANT SELECT ON TABLE prod.bronze.raw_events TO `dbt-service-principal`;
GRANT CREATE TABLE ON SCHEMA prod.silver TO `dbt-service-principal`;
Statement timeout. The warehouse cancelled the query.
Warning
Temptation: bump warehouse size and re-run. Resist. A timeout usually means a missing partition filter or an unbounded join, and a bigger warehouse just delays the real fix. Profile the compiled SQL first.
Databricks-specific error. Most common on incremental models:
| Error | Cause | Fix |
|---|---|---|
DELTA_MISSING_COLUMN | Source column referenced no longer exists | Check the compiled SQL; fix the model; --full-refresh if needed |
MERGE_CARDINALITY_VIOLATION | unique_key is not actually unique in the batch | Add a deduplication CTE before merge |
SCHEMA_CHANGE_NOT_ALLOWED | Target schema differs from model's output | Set on_schema_change: 'sync_all_columns' or --full-refresh |
WAREHOUSE_NOT_RUNNING | SQL Warehouse stopped | Start it; check auto-suspend |
4c. Test failure
# Capture the offending rows
dbt test --select <model> --store-failures
# Then inspect
SELECT * FROM prod.dbt_test_audit.<test_name> ORDER BY _loaded_at DESC LIMIT 100;
By test type:
| Test | If failing, it means | Next action |
|---|---|---|
not_null | Source has nulls in a column you declared not-null | Either the data is bad (fix upstream) or the schema is wrong (relax the test) |
unique | Duplicates in what should be a key | Check for join fanout; deduplicate; verify upstream |
accepted_values | Value outside the known set | Source added a new enum; update the accepted list or your handling |
relationships | Broken referential integrity | Ordering issue (dim loaded after fact) or actual FK violation |
Danger
Never "temporarily" disable a failing test to unblock prod without filing a data quality ticket. The test exists because someone once got burned. Disabling it rebuilds their burn scar.
4d. Incremental model failure
Decide whether to --full-refresh:
| Scenario | Full refresh needed? |
|---|---|
| New additive column upstream | No — append_new_columns handles it |
| Column type changed | Yes |
| Column removed upstream | Yes |
| Incremental logic fix | Yes |
| Historical source data corrected | Yes, or targeted replace_where backfill |
| Normal daily run just failed once | No — retry |
dbt run --select <model> --full-refresh
Warning
--full-refresh rewrites the entire table. On a ten-billion-row fact this is expensive. Schedule off-peak or scope the rewrite with replace_where.
4e. Source freshness
dbt source freshness --select source:bronze.raw_transactions
If the upstream ingestion is late, fix the pipeline, not dbt. The freshness failure is the correct signal; treating it as a dbt problem just hides the real issue.
Step 5: Re-run efficiently
# Just the failed model
dbt run --select <model>
# Failed model + its downstream (most common)
dbt run --select <model>+
# Only models in error state from the last run
dbt run --select result:error --state ./target/
# State-based: modified models against prod
dbt run --select state:modified+ --defer --state ./prod-manifest/
Important
Always prefer <model>+ over a blanket dbt run. Rebuilding unrelated models during a recovery adds cost, hides the actual fix, and occasionally makes things worse if another unrelated model has a silent issue.
Step 6: Verify recovery
Before you declare the incident over:
- All previously failed models show
successin the latest run. - All tests pass on the recovered models.
- Row counts are reasonable — not zero, not doubled.
- Source freshness is within SLA.
- Downstream consumers have been notified of any data delay.
Triage template
Fill this out every time:
Date/Time: _______
Model: _______
Materialization: [ ] view [ ] table [ ] incremental [ ] MV
Error class: [ ] Compile [ ] DB [ ] Dep [ ] Fresh [ ] Test [ ] Runtime
First-line error: _______
Blast radius: ___ downstream models
Resolution: [ ] Code fix [ ] Full refresh [ ] Permissions [ ] Upstream [ ] Escalate
Resolution detail: _______
Recovery command: _______
Attach the completed template to the incident record. Future you (or the next on-call) will thank you.
See also
- Common errors reference — symptom lookup.
- Incremental models guide — the patterns that cause the most 3am pages.
- Production readiness — build the things that make future triage faster.