Databricks exposes three compute tiers. They look similar in the UI and they bill very differently. Using the wrong one is the single most expensive mistake teams make on the platform.
| Compute | Workload | Cost profile |
|---|---|---|
| SQL warehouse | BI, dbt, ad-hoc SQL | DBU/s, scales per-query |
| Serverless jobs | Scheduled Python / notebook tasks | DBU/s for task runtime only |
| Job cluster | Jobs that need specific instance types or init scripts | DBU/s for task + cluster lifetime |
| All-purpose cluster | Interactive notebook development | 2–3× job compute; pays while idle |
Three rules set before anything else:
- SQL is a SQL warehouse.
- A scheduled job is jobs compute.
- A notebook you are typing in is all-purpose compute.
Violations of any of those three are billing leaks.
SQL warehouses
Three flavors exist: Serverless, Pro, Classic. There is essentially one right answer in 2026.
- Serverless SQL warehouses start in 2–6 seconds, autoscale per query via Databricks' Intelligent Workload Management, include Photon and Predictive IO, and receive every performance improvement first. Default for everything SQL.
- Pro is serverless-lite: about a four-minute cold start, no IWM. Use only when Serverless is not available in your region.
- Classic is legacy. Do not pick it for new work.
Sizing, counterintuitively
Start bigger than you think, then size down. Small warehouses saturate and queue; medium and large warehouses finish queries so fast they idle and cost less overall.
The metric that matters is Peak Queued Queries. If it is above zero under normal load, you have not mis-sized the warehouse: raise max_num_clusters before bumping the t-shirt.
One warehouse per workload class, not per team
wh-bi — many concurrent users, generous min/max clusters, scale-to-zero
wh-elt — one dbt run at a time, single cluster, large t-shirt
wh-adhoc — small, aggressive auto-stop
This separation prevents the "Tableau refresh slowed down our pipeline" problem. Teams share workload-class warehouses; they do not each get their own.
Serverless compute for jobs
In 2026 this is the default for scheduled jobs. Autoscaling is on. Photon is on. Cold starts measure in single-digit seconds. You pay only while tasks run.
Use it unless one of the following is true:
- Your job needs a custom instance type (GPU, memory-optimized beyond what serverless supports).
- Your job requires an init script.
- Your job depends on libraries not yet supported by serverless.
When those apply, use a job cluster (a classic cluster dedicated to one run, terminated on completion). Job clusters remain the fallback; serverless is the first choice.
All-purpose clusters
All-purpose clusters are for humans at notebooks. They stay alive between invocations, which is what makes interactive development feel fast. They are also priced at 2–3× job compute, which is what makes them a cost leak when anything automated attaches to one.
Danger
Never attach a production job to an all-purpose cluster. It is the single most common cost overrun surfaced in Databricks billing reviews. If a job is scheduled, it runs on serverless jobs or a job cluster. No exceptions that have not been reviewed and waived.
Photon
Photon is Databricks' vectorized C++ execution engine. It replaces the JVM query engine and delivers roughly 2–3× throughput for Parquet scans and aggregations.
- Enable Photon everywhere by default.
- Disable Photon only for jobs where queries finish in under two seconds. The engine's startup tax is not worth it on trivial work.
- On Serverless SQL warehouses Photon is always on, and you do not have a choice. That is fine.
Picking compute: a decision tree
Walk this in order. Stop at the first match.
- Is the workload a SQL query (BI, dbt, ad-hoc)? → SQL warehouse. Serverless unless unavailable.
- Is the workload a scheduled job (Airflow, Lakeflow Jobs, cron)? → Serverless jobs compute. Job cluster if serverless cannot support it.
- Are you typing into a notebook while looking at the output? → All-purpose cluster. Prefer one provisioned by an instance pool.
- Are you building a streaming pipeline with declarative semantics? → Serverless Lakeflow Declarative Pipelines. See the LDP guide.
- Are you serving low-latency transactional queries to an application? → Not compute. Lakebase. See the Lakebase guide.
Cost attribution
Every compute resource should carry custom tags for billing attribution:
{
"custom_tags": {
"team": "data-engineering",
"cost_center": "DE-001",
"project": "customer-360",
"environment": "prod"
}
}
Tags propagate to AWS Cost Explorer (or Azure / GCP equivalent) so Finance can cross-reference DBU spend against the team that owns it.
Cluster policies
Cluster policies enforce guardrails: allowed instance types, maximum worker counts, required tags, mandatory autotermination. They are the mechanism that prevents a rogue config from turning into a surprise on the invoice.
{
"node_type_id": {
"type": "allowlist",
"values": ["m5.xlarge", "m5.2xlarge", "r5.xlarge"]
},
"autoscale.max_workers": {
"type": "range",
"minValue": 1,
"maxValue": 20,
"defaultValue": 8
},
"autotermination_minutes": {
"type": "range",
"minValue": 10,
"maxValue": 120
},
"custom_tags.cost_center": {
"type": "fixed",
"value": "data-engineering"
}
}
Every Causeway workspace applies a default policy to its all-purpose clusters. You cannot opt out; you can only request an exception.
See also
- Cluster troubleshooting — when compute misbehaves.
- SQL warehouse reference — exhaustive config options.
- Production readiness — the compute items on the checklist.