A dbt project is production-ready when a steward can step away from it and nothing catches fire in their absence. This document is the checklist that makes that true.
1. Project hygiene
- [ ] Three-layer structure (
staging/,intermediate/,marts/); no fourth layer. - [ ] All models conform to naming standards (see Model authoring).
- [ ]
dbt_project.ymldeclares per-layer defaults (materialization, schema, tags). - [ ]
packages.ymlpins versions; no floatinglatest. - [ ]
dbt compilepasses on a fresh clone with no warnings. - [ ]
dbt parseruns in under 5 seconds on current project size.
2. Environment isolation
- [ ] Unity Catalog catalogs per environment:
dev,staging,prod. - [ ] Schemas carve domains inside each catalog.
- [ ]
generate_schema_namemacro produces per-developer schemas in dev, unprefixed in prod. - [ ]
profiles.ymluses env vars for secrets; no literal tokens in any committed file.
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- if target.name == 'prod' -%}
{{ custom_schema_name | default(target.schema, true) }}
{%- else -%}
{{ target.schema }}{% if custom_schema_name %}_{{ custom_schema_name }}{% endif %}
{%- endif -%}
{%- endmacro %}
Danger
Before any prod deploy, confirm generate_schema_name is in place. Without it, a developer's dbt run can silently write to prod schemas named the same as their dev schemas. The fastest way to lose confidence in a data platform is to find out a developer's test data ended up in the executive dashboard.
3. Materialization choices are justified
- [ ] Staging: all
view. Any exceptions documented. - [ ] Intermediate:
ephemeralunless referenced by 3+ downstream (thenview). - [ ] Marts:
tableby default;incrementalonly when full refresh exceeds ~10 min or costs more than the team is willing to absorb nightly;materialized_viewfor large aggregations Databricks can maintain;streaming_tablefor latency-critical paths. - [ ] Every
partition_byhas an isolation justification. Otherwise liquid clustering. - [ ] Named compute defined in
profiles.yml; heavy models pinned to a larger warehouse.
4. Testing
- [ ] Generic data tests on every staging primary key (
not_null+unique). - [ ] Generic data tests on every mart grain key.
- [ ]
relationshipstests on every foreign key in marts. - [ ] Unit tests (dbt 1.8+) on every model containing non-trivial logic: window functions, case-when chains, regex, date math, SCD.
- [ ] Singular tests for domain-specific invariants (e.g., "no refund greater than its order").
- [ ] Model contracts enforced on every mart that BI or services consume; versioned via
versions:for breaking changes. - [ ]
store_failures: trueat project level so failed rows are inspectable indbt_test_audit.
Unit tests run in dev and CI. Never in prod. dbt Labs is explicit on this.
5. CI pipeline
- [ ] Every push runs
dbt parse && dbt compilein under 30 seconds. - [ ] Every PR runs SQLFluff with the
dbttemplater. - [ ] Every PR runs unit tests:
dbt test --select test_type:unit. - [ ] Every PR runs Slim CI:
dbt build --select state:modified+ --defer --state ./prod-manifest/ --favor-stateinto a PR-specific schema. - [ ] PR schemas torn down on merge or timeout.
- [ ] PR CI finishes within 5 minutes on an average change.
See Slim CI guide for the canonical pipeline.
6. Prod deploy
- [ ] Single source of truth for when prod runs (Airflow DAG, dbt Cloud job, Databricks Workflow).
- [ ] Prod run uses
dbt build(neverdbt run+dbt testsplit). - [ ]
manifest.jsonuploaded to object storage after every successful prod run. CI reads this file; it is not optional. - [ ] A dated copy of
manifest.jsonis kept for 30+ days for audit and rollback. - [ ]
partial_parse.msgpackshipped alongside the manifest for orchestrator consumers (Cosmos).
# Trailing steps of the prod deploy
dbt build --target prod
aws s3 cp target/manifest.json s3://dbt-artifacts/prod/manifest.json
aws s3 cp target/manifest.json s3://dbt-artifacts/prod/archive/$(date -u +%FT%H%MZ)/manifest.json
aws s3 cp target/partial_parse.msgpack s3://dbt-artifacts/prod/partial_parse.msgpack
7. Orchestration
- [ ] One Airflow task per dbt node via Cosmos; not
BashOperator("dbt run"). - [ ]
LoadMode.DBT_MANIFESTin production; neverDBT_LS. - [ ]
TestBehavior.AFTER_EACHso test failures block downstream. - [ ]
emit_datasets=Trueso downstream DAGs can trigger on model updates. - [ ] dbt in its own virtualenv, separate from Airflow's package set.
- [ ]
partial_parse=Trueandpartial_parse.msgpackpresent to cut per-task parse cost.
See Orchestrate with Cosmos for the full configuration.
Warning
BashOperator("dbt run") is banned in Causeway production code. It gives you one Airflow task for an entire dbt DAG: no per-model retry, no lineage, no parallelism, no per-task log isolation. If a team cannot adopt Cosmos for compatibility reasons, get an exception on file before shipping.
8. Source freshness monitoring
- [ ] Every source declares
warn_afteranderror_afterfreshness thresholds. - [ ]
dbt source freshnessruns in orchestration on a schedule appropriate to each source's cadence. - [ ] Freshness failures alert the data team, not the dbt CI pipeline.
- [ ] Source freshness results uploaded as artifacts (
target/sources.json) for tools like Elementary.
9. Observability
- [ ]
manifest.json+run_results.json+catalog.jsonuploaded after every run. - [ ] A dashboard (Elementary, custom, or otel-based) shows model run success/failure over time.
- [ ] Model execution time is tracked per node; regressions flag in review.
- [ ] On-call rotation knows where to find the last-run logs without asking.
10. Documentation and contracts
- [ ] Every public mart has a
descriptionfield in its yml. - [ ] Every public mart's columns have
descriptionfields. - [ ] Every public mart has an enforced
contract:. - [ ]
dbt docs generateruns in CI; the output is published somewhere internal consumers can browse.
Note
"Documented" means in the yml file, not in a Notion page. dbt's docs regenerate on every run; Notion pages do not. When yml and Notion disagree, yml wins. The fastest way to keep docs honest is to make them part of the artifact pipeline.
11. Backfills and recovery
- [ ] Project has a documented runbook for
--full-refreshing a specific model. - [ ] Project has a documented runbook for backfilling a date range on an incremental model.
- [ ] On-call knows how to identify which models errored last run and how to rerun just those.
- [ ] For high-value marts, backfill windows are tested at least once per quarter.
See Failure triage for the 5-minute incident procedure.
12. Security
- [ ] No secrets in committed files. All sensitive values from env vars.
- [ ] Service principal used for prod runs has scope-limited grants:
USE CATALOG+USE SCHEMA+SELECTon sources;CREATE TABLEon target schemas; nothing broader. - [ ] Dev tokens rotated on a schedule (90 days typical).
- [ ] PII-bearing models declare classification in Causeway's contract system (see the contract triple) and are materialized into Restricted-tier schemas.
13. Scale readiness
When the project crosses these thresholds, extra attention is warranted:
| Model count | Action |
|---|---|
| 100 | Partial parse must be on (default since 1.4). Verify. |
| 300 | --empty CI path for sub-minute PR feedback. |
| 500 | Cosmos TestBehavior.BUILD or per-model task groups to keep DAG parse tractable. |
| 800 | Dedicated manifest archival, retention policy, and a documented recovery procedure. |
| 1500 | Split into multiple dbt projects, glued by exposures and manifest handoffs (see dbt-loom). |
14. The promote-to-prod gate
Before a dbt project (or a new mart) is considered production-ready, a reviewer confirms each section above with a concrete artifact: a PR review note, a link to the CI run, a screenshot of the object-storage manifest, etc. The gate is the checklist; there is no other ceremony.
Deviations require an RFD and a dated waiver in the project's README. Waivers expire.
See also
- Model authoring standards — the rules that apply per model.
- Slim CI guide — the pipeline that makes section 5 cheap.
- Failure triage — the procedure that makes section 11 survivable.
- The contract triple — the governance layer above model contracts.