Databricks asset authoring standards

Standards Databricks

These standards bind every job, pipeline, notebook, and bundle that deploys to a Causeway Databricks workspace. They are not recommendations.

1. Workspace and environment isolation

Three separate workspaces: dev, staging, prod. One Databricks workspace per environment, never shared.
One Unity Catalog catalog per environment: dev, staging, prod. Schemas carve domains inside a catalog.
Service principals own prod, not humans. mode: production bundles require run_as.service_principal_name.
Human identity deploys to dev only. Dev targets use mode: development so twenty engineers share the workspace without collision.

Danger

Never share a workspace across dev/staging/prod. The blast radius of a bad deploy is unbounded: one bad notebook can drop prod tables, one cluster policy mistake can burn the prod monthly budget. Three workspaces are the cheapest insurance you will ever buy.

2. Notebooks vs. Python packages

Logic lives in .py packages. Notebooks are thin bootstrappers.

my_project/
  src/mypkg/
    transforms.py        # pure functions; pytest-friendly; no spark session coupling
    io.py                # readers, writers, side effects
  notebooks/
    run_daily.py         # a dozen lines that import mypkg and call transforms
  tests/
    test_transforms.py   # runs in plain CI via chispa or pytest
  bundle.yml

Rules:

Transformation logic is unit-testable Python. No business rule lives only in a notebook.
Ship mypkg as a wheel. Attach the wheel to the job; let the notebook bootstrap.
Notebooks do not ship to production in 2026. If a notebook is scheduled, its logic belongs in a wheel.
Unit tests run in plain CI, on a runner without a Databricks workspace. Chispa and pytest handle the Spark parts.

Warning

Two notebook pitfalls catch every team. First, git pull inside a Databricks Git folder destroys cell output and in-memory state; build with the assumption that the floor can be yanked. Second, notebook source formats (.py, .sql, .scala) strip outputs on commit; do not treat a notebook like a document.

3. Asset Bundles are the deployment unit

Every job, pipeline, warehouse, and dashboard deploys via a bundle. Exceptions require a written waiver.

Standards for every bundle:

databricks.yml defines bundle + variables + targets. One target per environment.
resources/ holds one YAML per resource: jobs/, pipelines/, warehouses/.
src/ holds Python packages.
tests/ holds pytest tests that run in plain CI.
databricks bundle validate passes on every PR, before anything else runs.
Variables carry per-env values; no hard-coded workspace hosts, warehouse IDs, or catalog names in resource YAML.
Dev target uses mode: development; staging and prod use mode: production with run_as.service_principal_name.

4. Naming

Resource	Pattern	Example
Catalog	Environment	`prod`, `staging`, `dev`
Schema	Domain or data layer	`bronze`, `silver`, `gold`, `finance`
Table	snake_case, descriptive	`customer_transactions`, `daily_revenue`
Column	snake_case	`customer_id`, `created_at`
Volume	Purpose-first	`raw_files`, `model_artifacts`
Job	`<purpose>_<scope>`	`transform_daily`, `ingest_hourly`
Pipeline	`<domain>_<layer>`	`customers_silver`, `events_bronze`
Warehouse	`wh-<workload-class>`	`wh-bi`, `wh-elt`, `wh-adhoc`
Cluster policy	`<team>-<purpose>`	`de-standard`, `ds-gpu`
Service principal	`sp-<project>-<env>`	`sp-analytics-prod`

5. Tags are mandatory

Every deployable resource carries tags for cost attribution and governance:

tags:
  environment: ${bundle.target}
  owner: data-engineering
  cost_center: DE-001
  project: customer-360

The four required tags:

environment: matches the bundle target (dev, staging, prod).
owner: the team.
cost_center: the billing cost center.
project: the product or initiative.

Tables additionally carry governance tags:

classification: public, internal, or confidential.
pii: true or false.
domain: the business domain.
tier: bronze, silver, or gold.

6. Compute

See compute types for the framework. Standards summarized:

SQL warehouses are Serverless. Pro only when Serverless is unavailable in the region. Classic is banned for new work.
Job compute is Serverless. Job clusters only when Serverless cannot support the workload.
All-purpose compute is for notebooks only. No scheduled jobs on all-purpose compute; this is the single largest Databricks cost leak.
Photon is on by default. Disable only for jobs with sub-2-second queries where the startup tax hurts.
Every resource tagged for cost attribution.

7. Authentication

CI authenticates via workload identity federation (OIDC). GitHub Actions / Azure DevOps to a Databricks service principal. No long-lived PATs in CI secrets.
Human developers authenticate via databricks auth login which uses OAuth against the workspace.
Service principals scope narrowly. A prod service principal gets USE CATALOG prod, CREATE JOB, MODIFY on target schemas, and nothing broader.

Important

PAT tokens in CI secrets are banned in Causeway projects. OIDC federation takes an afternoon to set up per workspace; it saves every rotation-review thereafter. If your project still uses PATs, migrate before your next security review.

8. Secrets

Secrets live in Databricks Secret Scopes or the cloud's KMS/Key Vault. Never in bundle YAML. Never in notebooks. Never as vars: in dbt.
Reference secrets from bundles via ${secrets.<scope>.<key>}; from notebooks via dbutils.secrets.get(scope, key).
Rotate on a schedule (90 days for human-scope secrets; service principals are the default path precisely because they avoid this).

9. Unity Catalog layout

Layer schemas: bronze, silver, gold, sandbox.
Domain schemas: finance, marketing, product, ops.
A table belongs in one schema. Cross-domain facts go in gold, not in each domain's schema.
Managed tables by default. External tables only when an existing consumer or compliance requirement dictates the storage path.
Every public table (schema gold) has a description and column-level comments.

10. Lakeflow usage

LDP owns silver streaming tables with quality gates and CDC ingestion.
dbt owns gold marts and batch SQL transformations.
Lakeflow Jobs orchestrates Databricks-internal workloads.
Airflow (+ Cosmos) orchestrates cross-platform DAGs; the job itself is defined in a DAB, called via DatabricksRunNowOperator.

See Lakeflow concepts for the decision framework.

11. Git and concurrent development

Trunk-based development. Short-lived feature branches; merge to main in hours, not days.
Git Folders for notebook-heavy work; external Git + IDE + Databricks Connect for pure-code projects.
Per-developer isolation:
- DAB mode: development prepends [dev ${user}] to resource names.
- Unity Catalog schema-per-user for dbt (dbt_<user>) and ad-hoc output.
- Lakebase branching for ephemeral test OLTP.
Feature flags via DAB variables or target selectors, not long-running branches.

12. CI/CD pipeline

Every PR runs:

databricks bundle validate         # catches YAML / permission issues
pytest src/ tests/                  # pure-Python unit tests
dbt parse && dbt compile            # if the project has dbt
dbt test --select test_type:unit
sqlfluff lint                       # SQL style
databricks bundle deploy -t dev     # deploy to ephemeral dev target
<smoke tests against dev>

On merge to main:

databricks bundle deploy -t staging
<integration tests>
databricks bundle deploy -t prod    # service-principal auth

Every prod deploy tags the commit. Rollback is bundle deploy at the previous tag.