A cluster is not starting, or it started and promptly died, or it is up but queries are pinned at 100% CPU. This guide is the procedure. It assumes you can read a stack trace and have workspace admin or the equivalent.

Step 1: Classify (30 seconds)

Open the Databricks UI → Compute → the affected cluster → Event log. Look at the last state transition.

TransitionClassFirst place to look
STARTING → RUNNING but slow queriesRunning / performanceSpark UI, Ganglia
STARTING → TERMINATINGStartup failureEvent log reason code
RUNNING → TERMINATINGUnexpected terminationEvent log reason code
RUNNING → RESIZINGAutoscale eventNormal; not a problem
Never left PENDINGCluster provisioningCloud provider limits

Step 2: Startup failures

Event log's TERMINATING event carries a reason code. The canonical ones:

Reason codeWhat it meansFix
CLOUD_PROVIDER_LAUNCH_FAILUREAWS/Azure/GCP could not provision the instance typeCheck EC2 limits; try a different type; try a different AZ
DRIVER_UNREACHABLEDatabricks cannot talk to the driver nodeNetwork issue: security groups, VPC peering, routing
INIT_SCRIPT_FAILUREAn init script returned non-zeroRead the init script logs in the cluster event log; fix the script
CLOUD_PROVIDER_SHUTDOWNThe cloud reclaimed the instance (spot preemption, maintenance)Retry; if persistent, use on-demand for the driver
INTERNAL_ERRORDatabricks internal issueRetry; if persistent, contact support

CLOUD_PROVIDER_LAUNCH_FAILURE in detail

Two common roots:

  1. EC2 limit. AWS limits vCPU count per instance family per region. Request a raise via the console or try a different family.
  2. AZ exhaustion. Spot capacity for a given instance type in a given AZ dries up. Switch to zone_id: auto so Databricks picks whichever AZ has capacity.
{
  "aws_attributes": {
    "availability": "SPOT_WITH_FALLBACK",
    "zone_id": "auto",
    "first_on_demand": 1
  }
}

INIT_SCRIPT_FAILURE

Init scripts live in a Unity Catalog volume or DBFS. When one fails:

  1. Find the script at the path referenced in the cluster config.
  2. Find the log. UI: Compute → cluster → Logs → init_scripts/<timestamp>-<node>/....
  3. Reproduce locally (init scripts are bash).

Note

Init scripts run in cluster-startup order as root. A broken init script on a shared cluster blocks every user of that cluster. Test init scripts on a single-user cluster first; promote to cluster policies only after.

Step 3: Running but slow

Cluster is up. Jobs do not finish. Three places to look, in order:

Ganglia (cluster-level)

UI: Compute → cluster → Metrics → Ganglia.

MetricWarning signWhat it means
CPU utilization constantly > 90%Cluster saturatedWorkload is CPU-bound; scale horizontally or vertically
Memory near limitRisk of OOMBigger instance, or fewer parallel tasks per executor
Network I/O spikesLarge shuffle or S3 transferOptimize the query; broadcast the small side of joins
JVM GC time > 20%Memory pressureIncrease executor memory

Spark UI (query-level)

UI: Compute → cluster → Spark UI → Jobs.

Find the slow job. Click through to the stage that is the bottleneck.

Look for:

Query profile (for SQL warehouses)

UI: SQL → Queries → the offending query → Query Profile.

SectionWhat to check
PlanningLong planning time → complex views, too many tables, stale statistics
ExecutionWhich operator is slowest (scan, join, sort, aggregate)
I/ORows scanned vs. rows returned (low ratio = missing partition pruning / Z-order)
SpillDisk spill → insufficient memory
PhotonIs Photon enabled (always yes on Serverless SQL)

Common fixes:

Step 4: Spot preemption

Symptom: a task suddenly fails with "executor lost". Cause: a spot instance got reclaimed by the cloud provider.

Mitigations:

{
  "aws_attributes": {
    "availability": "SPOT_WITH_FALLBACK",
    "first_on_demand": 1
  },
  "spark_conf": {
    "spark.task.maxFailures": "4"
  }
}

Warning

Never run a critical production job with first_on_demand: 0. The driver being on spot means any preemption kills the whole job; you get partial results and no retry semantics. On-demand for the driver is a tiny cost premium for a large reliability gain.

Step 5: Out-of-memory errors

Symptom: java.lang.OutOfMemoryError: Java heap space in executor logs.

Three fixes, in order of effort:

  1. Enable adaptive query execution if not already:
    spark.sql.adaptive.enabled=true
    spark.sql.adaptive.autoOptimizeShuffle.enabled=true
    
  2. Bigger instance type (horizontal scaling rarely fixes OOM; vertical usually does).
  3. Change the operation: broadcast smaller tables, bucket larger ones, reduce partition size.

Specifically for Delta writes:

spark.databricks.delta.optimizeWrite.enabled=true
spark.databricks.delta.autoCompact.enabled=true

These combine small writes into larger files, which reduces downstream read cost and often reduces OOM risk for writes too.

Step 6: When in doubt, restart

Long-running all-purpose clusters accumulate state: broken temp views, stale broadcasts, half-done caches, classloader leaks. If diagnosis is taking more than 20 minutes and the workload is not time-critical, restart the cluster:

databricks clusters restart <cluster-id>

Note

Schedule a weekly restart for any all-purpose cluster that lives longer than a week. Accumulated state kills performance in ways that do not show up in a single metric; the restart is cheap compared to the debugging bills.

Operational reference

Event log events

Driver log locations

UI: Compute → cluster → Driver Logs.

When to escalate

Engage infrastructure or Databricks support when:

See also