A cluster is not starting, or it started and promptly died, or it is up but queries are pinned at 100% CPU. This guide is the procedure. It assumes you can read a stack trace and have workspace admin or the equivalent.
Step 1: Classify (30 seconds)
Open the Databricks UI → Compute → the affected cluster → Event log. Look at the last state transition.
| Transition | Class | First place to look |
|---|---|---|
STARTING → RUNNING but slow queries | Running / performance | Spark UI, Ganglia |
STARTING → TERMINATING | Startup failure | Event log reason code |
RUNNING → TERMINATING | Unexpected termination | Event log reason code |
RUNNING → RESIZING | Autoscale event | Normal; not a problem |
Never left PENDING | Cluster provisioning | Cloud provider limits |
Step 2: Startup failures
Event log's TERMINATING event carries a reason code. The canonical ones:
| Reason code | What it means | Fix |
|---|---|---|
CLOUD_PROVIDER_LAUNCH_FAILURE | AWS/Azure/GCP could not provision the instance type | Check EC2 limits; try a different type; try a different AZ |
DRIVER_UNREACHABLE | Databricks cannot talk to the driver node | Network issue: security groups, VPC peering, routing |
INIT_SCRIPT_FAILURE | An init script returned non-zero | Read the init script logs in the cluster event log; fix the script |
CLOUD_PROVIDER_SHUTDOWN | The cloud reclaimed the instance (spot preemption, maintenance) | Retry; if persistent, use on-demand for the driver |
INTERNAL_ERROR | Databricks internal issue | Retry; if persistent, contact support |
CLOUD_PROVIDER_LAUNCH_FAILURE in detail
Two common roots:
- EC2 limit. AWS limits vCPU count per instance family per region. Request a raise via the console or try a different family.
- AZ exhaustion. Spot capacity for a given instance type in a given AZ dries up. Switch to
zone_id: autoso Databricks picks whichever AZ has capacity.
{
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK",
"zone_id": "auto",
"first_on_demand": 1
}
}
INIT_SCRIPT_FAILURE
Init scripts live in a Unity Catalog volume or DBFS. When one fails:
- Find the script at the path referenced in the cluster config.
- Find the log. UI: Compute → cluster → Logs →
init_scripts/<timestamp>-<node>/.... - Reproduce locally (init scripts are bash).
Note
Init scripts run in cluster-startup order as root. A broken init script on a shared cluster blocks every user of that cluster. Test init scripts on a single-user cluster first; promote to cluster policies only after.
Step 3: Running but slow
Cluster is up. Jobs do not finish. Three places to look, in order:
Ganglia (cluster-level)
UI: Compute → cluster → Metrics → Ganglia.
| Metric | Warning sign | What it means |
|---|---|---|
| CPU utilization constantly > 90% | Cluster saturated | Workload is CPU-bound; scale horizontally or vertically |
| Memory near limit | Risk of OOM | Bigger instance, or fewer parallel tasks per executor |
| Network I/O spikes | Large shuffle or S3 transfer | Optimize the query; broadcast the small side of joins |
| JVM GC time > 20% | Memory pressure | Increase executor memory |
Spark UI (query-level)
UI: Compute → cluster → Spark UI → Jobs.
Find the slow job. Click through to the stage that is the bottleneck.
Look for:
- Skewed partitions: one task takes 10x the time of the median. Salt the join key or enable AQE skew handling.
- Spill to disk: execution memory exceeded; tasks spill intermediate state. Bigger executor memory, or change the operation.
- Long shuffle read: task spends most of its time fetching from other executors. Reduce
spark.sql.shuffle.partitionsif you have too many tiny partitions; increase if too few giant ones.
Query profile (for SQL warehouses)
UI: SQL → Queries → the offending query → Query Profile.
| Section | What to check |
|---|---|
| Planning | Long planning time → complex views, too many tables, stale statistics |
| Execution | Which operator is slowest (scan, join, sort, aggregate) |
| I/O | Rows scanned vs. rows returned (low ratio = missing partition pruning / Z-order) |
| Spill | Disk spill → insufficient memory |
| Photon | Is Photon enabled (always yes on Serverless SQL) |
Common fixes:
- Missing partition filter. Add one:
WHERE dt >= '2026-03-01'. - Missing Z-order.
OPTIMIZE prod.gold.events ZORDER BY (customer_id, event_type). - Stale statistics.
ANALYZE TABLE prod.gold.events COMPUTE STATISTICS FOR ALL COLUMNS. - Full-table scans on Delta tables. Confirm the table has Delta statistics; confirm the planner is using them.
Step 4: Spot preemption
Symptom: a task suddenly fails with "executor lost". Cause: a spot instance got reclaimed by the cloud provider.
Mitigations:
{
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK",
"first_on_demand": 1
},
"spark_conf": {
"spark.task.maxFailures": "4"
}
}
first_on_demand: 1keeps the driver on an on-demand instance; losing the driver is fatal to the whole job.SPOT_WITH_FALLBACKfalls back to on-demand if spot is unavailable.spark.task.maxFailures: 4retries tasks when their executor dies. The default is 4 already; verify nothing has overridden it lower.
Warning
Never run a critical production job with first_on_demand: 0. The driver being on spot means any preemption kills the whole job; you get partial results and no retry semantics. On-demand for the driver is a tiny cost premium for a large reliability gain.
Step 5: Out-of-memory errors
Symptom: java.lang.OutOfMemoryError: Java heap space in executor logs.
Three fixes, in order of effort:
- Enable adaptive query execution if not already:
spark.sql.adaptive.enabled=true spark.sql.adaptive.autoOptimizeShuffle.enabled=true - Bigger instance type (horizontal scaling rarely fixes OOM; vertical usually does).
- Change the operation: broadcast smaller tables, bucket larger ones, reduce partition size.
Specifically for Delta writes:
spark.databricks.delta.optimizeWrite.enabled=true
spark.databricks.delta.autoCompact.enabled=true
These combine small writes into larger files, which reduces downstream read cost and often reduces OOM risk for writes too.
Step 6: When in doubt, restart
Long-running all-purpose clusters accumulate state: broken temp views, stale broadcasts, half-done caches, classloader leaks. If diagnosis is taking more than 20 minutes and the workload is not time-critical, restart the cluster:
databricks clusters restart <cluster-id>
Note
Schedule a weekly restart for any all-purpose cluster that lives longer than a week. Accumulated state kills performance in ways that do not show up in a single metric; the restart is cheap compared to the debugging bills.
Operational reference
Event log events
STARTING→RUNNING: normal startupSTARTING→TERMINATING: startup failure (reason code above)RUNNING→RESIZING: autoscale up or downRUNNING→TERMINATING: unexpected terminationRUNNING→RESTARTING: manual or scheduled restart
Driver log locations
UI: Compute → cluster → Driver Logs.
- stdout: print statements from your code.
- stderr: Spark warnings, errors, stack traces.
- log4j: detailed Spark logging (INFO-level spam; useful for edge cases).
When to escalate
Engage infrastructure or Databricks support when:
INTERNAL_ERRORpersists across retries.- Workspace-wide launch failures with no apparent cloud-provider cause.
- A cluster fails to terminate (stuck in
TERMINATING). - Billing anomalies not explained by the cluster's config.
See also
- Compute types — right-sizing and cluster policies.
- Common errors — symptom-first lookup.
- Production readiness — what to wire up so the next failure is easier to triage.