A Databricks workload is production-ready when the on-call engineer can step away from it and nothing catches fire in their absence. This checklist is the threshold.
1. Workspace isolation
- [ ] Three separate Databricks workspaces:
dev,staging,prod. - [ ] Unity Catalog catalog-per-environment:
dev,staging,prod. - [ ] Domain schemas inside each catalog; no cross-catalog ad-hoc references.
- [ ] Service principals own prod resources; no human identity can deploy or run prod jobs.
- [ ] Cluster policies applied per workspace; no team has opted out without a waiver.
Danger
Before promoting any workload to prod, verify the three-workspace isolation. Shared workspaces are the single most common source of "prod outage caused by a dev experiment" post-mortems. One bad notebook run in a shared workspace is enough to drop a prod table.
2. Asset Bundle hygiene
- [ ] Every deployable resource (jobs, pipelines, warehouses, dashboards) defined in a DAB.
- [ ]
databricks.ymldeclaresvariables:for per-env values; no hard-coded hosts, warehouse IDs, or catalogs in resource YAML. - [ ] Dev target:
mode: development. Staging + prod:mode: productionwithrun_as.service_principal_name. - [ ]
databricks bundle validateruns in CI before any deploy step. - [ ] Prod deploys tagged (git tag + DAB resource tag) so rollback is one command.
See Asset Bundles guide for the canonical layout.
3. Compute
- [ ] SQL warehouses: Serverless, one per workload class (
wh-bi,wh-elt,wh-adhoc). - [ ] Scheduled jobs: Serverless job compute; job clusters only when Serverless cannot support the workload.
- [ ] No production job attached to all-purpose compute.
- [ ] Photon on by default; disabled only for justified sub-2-second workloads.
- [ ] Every compute resource tagged (
environment,owner,cost_center,project). - [ ] Cluster policies enforce allowed instance types, max worker counts, mandatory autotermination.
4. Unity Catalog
- [ ] Managed tables by default; external tables only for justified reasons.
- [ ] Every mart (
goldschema) hasdescriptionand column-levelcomments. - [ ] Tables carry governance tags:
classification,pii,domain,tier. - [ ] Grants scoped to the least-privilege principle; no
ALL PRIVILEGESgrants to individual users in prod. - [ ] Storage credentials use IAM roles, not access keys.
- [ ] External locations validated (
VALIDATE EXTERNAL LOCATION) on creation.
See Unity Catalog concepts for the hierarchy.
5. Code organization
- [ ] Transformation logic lives in
.pypackages; notebooks are thin bootstrappers. - [ ] Unit tests pass in plain CI (without a Databricks workspace).
- [ ] Wheel built in CI, attached to jobs via DAB.
- [ ] No business rule encoded only in a notebook.
- [ ] Jinja and SQL code passes
sqlfluff lintin CI.
6. Testing
- [ ] Unit tests (Python) for every module containing non-trivial logic.
- [ ] Chispa / pytest for Spark transformations.
- [ ] LDP expectations on silver tables;
expect_or_droporexpect_or_failper data invariant. - [ ] dbt model contracts on every gold mart; see dbt authoring standards.
- [ ] Integration tests run on staging before prod deploy.
- [ ] Smoke tests run after each prod deploy.
7. Authentication and secrets
- [ ] CI authenticates via workload identity federation (OIDC). No PATs in CI secrets.
- [ ] Service principal credentials rotated on the platform's cadence (automatic for OIDC).
- [ ] All secrets in Databricks Secret Scopes or cloud KMS. No YAML secrets.
- [ ] Notebooks reference secrets via
dbutils.secrets.get; bundles via${secrets.<scope>.<key>}.
8. Observability
- [ ] Job runs surfaced in the Lakeflow Jobs UI; notifications on failure.
- [ ] LDP pipeline event logs queried into a dashboard (Elementary, Grafana, or Causeway's internal view).
- [ ] SQL warehouse query history reviewed for drift: p95 query duration, failed query rate, queued-query peaks.
- [ ] Cluster event logs and driver logs retained for a minimum of 30 days.
- [ ] DBU consumption monitored per team via tag attribution.
9. Orchestration
- [ ] Jobs defined in DABs; Airflow (if used) calls
DatabricksRunNowOperatorby job ID. - [ ]
DatabricksSubmitRunOperatorbanned in production; two sources of truth is a waiver-only exception. - [ ] Lakeflow Jobs for single-platform workloads; Airflow only when the DAG crosses platforms.
- [ ] Cosmos with
LoadMode.DBT_MANIFESTwhen dbt appears inside Airflow. - [ ] Retries configured per task; retry count bounded; alerts on retry-exhausted failure.
10. Governance
- [ ] Tables classified per Causeway RAG tier (
restricted,internal,public). - [ ] PII-bearing tables in Restricted-tier schemas with masking policies applied.
- [ ] Audit logs queried weekly for unexpected access patterns.
- [ ] Contracts enforced on public marts per the contract triple.
11. Recovery
- [ ] Runbook for rolling back a bundle deploy (git checkout → validate → deploy).
- [ ] Runbook for full-refreshing an LDP pipeline or dbt incremental model.
- [ ] Runbook for a Lakebase point-in-time restore.
- [ ] Backfill procedure tested at least once per quarter for high-value workloads.
- [ ] On-call knows where to find the last-run logs, event logs, and query history without asking.
See cluster troubleshooting and common errors for the triage procedures.
12. Cost controls
- [ ] Every compute resource tagged for attribution.
- [ ] DBU budget alerts configured per workspace.
- [ ] Lakebase instances scale to zero when idle.
- [ ] SQL warehouses have aggressive
auto_stop_mins(5 min Serverless, 10 min Pro). - [ ] LDP pipelines use triggered mode unless streaming semantics justify continuous.
- [ ] No all-purpose clusters left running overnight for scheduled work.
13. Documentation
- [ ] Every job's purpose, schedule, and owner in a README or dbt / LDP description.
- [ ] Every prod pipeline has a one-paragraph architecture note: upstream sources, downstream consumers, grain, freshness.
- [ ] Runbooks referenced in the on-call playbook.
- [ ] Contract/interface changes to public marts documented in an RFD before shipping.
Note
"Documented" means committed to the repository, not linked to Notion. Notion pages drift; the repository is the source of truth. Databricks' dbt docs generate and LDP lineage make this largely automatic; your job is to keep the YAML descriptions current.
14. The promote-to-prod gate
Before a workload is considered production-ready, a reviewer confirms each section above with a concrete artifact: a PR review note, a link to the CI run, a screenshot of the workspace audit trail, a runbook link. The gate is the checklist.
Deviations require an RFD and a dated waiver in the project's README. Waivers expire on a fixed cadence (typical: 90 days).
See also
- Asset authoring standards — the per-resource rules these readiness items rest on.
- Cluster troubleshooting — the procedure on-call runs when section 11 is tested live.
- Unity Catalog — the governance plane most of this checklist lives on.