A Databricks workload is production-ready when the on-call engineer can step away from it and nothing catches fire in their absence. This checklist is the threshold.

1. Workspace isolation

Danger

Before promoting any workload to prod, verify the three-workspace isolation. Shared workspaces are the single most common source of "prod outage caused by a dev experiment" post-mortems. One bad notebook run in a shared workspace is enough to drop a prod table.

2. Asset Bundle hygiene

See Asset Bundles guide for the canonical layout.

3. Compute

4. Unity Catalog

See Unity Catalog concepts for the hierarchy.

5. Code organization

6. Testing

7. Authentication and secrets

8. Observability

9. Orchestration

10. Governance

11. Recovery

See cluster troubleshooting and common errors for the triage procedures.

12. Cost controls

13. Documentation

Note

"Documented" means committed to the repository, not linked to Notion. Notion pages drift; the repository is the source of truth. Databricks' dbt docs generate and LDP lineage make this largely automatic; your job is to keep the YAML descriptions current.

14. The promote-to-prod gate

Before a workload is considered production-ready, a reviewer confirms each section above with a concrete artifact: a PR review note, a link to the CI run, a screenshot of the workspace audit trail, a runbook link. The gate is the checklist.

Deviations require an RFD and a dated waiver in the project's README. Waivers expire on a fixed cadence (typical: 90 days).

See also