A dbt run failed at 3am. You are on call. This guide tells you what to do in the first five minutes, then the next twenty. It assumes you can read Python tracebacks and have kubectl or a runbook-equivalent handy.

Step 1: Identify the failure (30 seconds)

From the Airflow task log or the dbt CLI output, record four things:

Note

The error class is the single most important thing to get right. The recovery path is entirely different for "Jinja typo" versus "warehouse is down" versus "source has a new column". Do not skip classification to save time.

The raw signals:

# From wherever the dbt command ran
cat target/run_results.json | jq '.results[] | select(.status == "error")'

# Or with the command still open:
dbt run --select tag:daily 2>&1 | tail -30

Step 2: Classify

First-line patternClassFix-it lives in
Compilation Error in modelCompilationYour code
Database Error in model + SQLDatabaseYour SQL or your permissions
depends on a node named X which was not foundDependencyYour refs or a deleted upstream
Source freshness failureFreshnessUpstream data pipeline
Failure in testTestData quality — see below
Runtime Error + connectionRuntimeWarehouse / infrastructure

Step 3: Check blast radius (30 seconds)

# What depends on the failed model?
dbt ls --select <failed_model>+ --output name | wc -l

If the number is above 50, the failure is a major incident. Page others before continuing. If the number is 1 or 2, you can keep triaging solo.

Step 4: Pick the recovery path

4a. Compilation error

Tip

These are almost always your own bug and almost always fast to fix. Run dbt compile --select <model> locally, read the expanded Jinja error, patch, re-push.

4b. Database error

Three sub-cases cover 90%:

Permission denied. Missing GRANT on the source or target. Fix in Unity Catalog:

GRANT SELECT ON TABLE prod.bronze.raw_events TO `dbt-service-principal`;
GRANT CREATE TABLE ON SCHEMA prod.silver TO `dbt-service-principal`;

Statement timeout. The warehouse cancelled the query.

Warning

Temptation: bump warehouse size and re-run. Resist. A timeout usually means a missing partition filter or an unbounded join, and a bigger warehouse just delays the real fix. Profile the compiled SQL first.

Databricks-specific error. Most common on incremental models:

ErrorCauseFix
DELTA_MISSING_COLUMNSource column referenced no longer existsCheck the compiled SQL; fix the model; --full-refresh if needed
MERGE_CARDINALITY_VIOLATIONunique_key is not actually unique in the batchAdd a deduplication CTE before merge
SCHEMA_CHANGE_NOT_ALLOWEDTarget schema differs from model's outputSet on_schema_change: 'sync_all_columns' or --full-refresh
WAREHOUSE_NOT_RUNNINGSQL Warehouse stoppedStart it; check auto-suspend

4c. Test failure

# Capture the offending rows
dbt test --select <model> --store-failures

# Then inspect
SELECT * FROM prod.dbt_test_audit.<test_name> ORDER BY _loaded_at DESC LIMIT 100;

By test type:

TestIf failing, it meansNext action
not_nullSource has nulls in a column you declared not-nullEither the data is bad (fix upstream) or the schema is wrong (relax the test)
uniqueDuplicates in what should be a keyCheck for join fanout; deduplicate; verify upstream
accepted_valuesValue outside the known setSource added a new enum; update the accepted list or your handling
relationshipsBroken referential integrityOrdering issue (dim loaded after fact) or actual FK violation

Danger

Never "temporarily" disable a failing test to unblock prod without filing a data quality ticket. The test exists because someone once got burned. Disabling it rebuilds their burn scar.

4d. Incremental model failure

Decide whether to --full-refresh:

ScenarioFull refresh needed?
New additive column upstreamNo — append_new_columns handles it
Column type changedYes
Column removed upstreamYes
Incremental logic fixYes
Historical source data correctedYes, or targeted replace_where backfill
Normal daily run just failed onceNo — retry
dbt run --select <model> --full-refresh

Warning

--full-refresh rewrites the entire table. On a ten-billion-row fact this is expensive. Schedule off-peak or scope the rewrite with replace_where.

4e. Source freshness

dbt source freshness --select source:bronze.raw_transactions

If the upstream ingestion is late, fix the pipeline, not dbt. The freshness failure is the correct signal; treating it as a dbt problem just hides the real issue.

Step 5: Re-run efficiently

# Just the failed model
dbt run --select <model>

# Failed model + its downstream (most common)
dbt run --select <model>+

# Only models in error state from the last run
dbt run --select result:error --state ./target/

# State-based: modified models against prod
dbt run --select state:modified+ --defer --state ./prod-manifest/

Important

Always prefer <model>+ over a blanket dbt run. Rebuilding unrelated models during a recovery adds cost, hides the actual fix, and occasionally makes things worse if another unrelated model has a silent issue.

Step 6: Verify recovery

Before you declare the incident over:

Triage template

Fill this out every time:

Date/Time:         _______
Model:             _______
Materialization:   [ ] view  [ ] table  [ ] incremental  [ ] MV
Error class:       [ ] Compile  [ ] DB  [ ] Dep  [ ] Fresh  [ ] Test  [ ] Runtime
First-line error:  _______
Blast radius:      ___ downstream models
Resolution:        [ ] Code fix  [ ] Full refresh  [ ] Permissions  [ ] Upstream  [ ] Escalate
Resolution detail: _______
Recovery command:  _______

Attach the completed template to the incident record. Future you (or the next on-call) will thank you.

See also