Troubleshoot a failing Databricks cluster

How-tos Databricks

A cluster is not starting, or it started and promptly died, or it is up but queries are pinned at 100% CPU. This guide is the procedure. It assumes you can read a stack trace and have workspace admin or the equivalent.

Step 1: Classify (30 seconds)

Open the Databricks UI → Compute → the affected cluster → Event log. Look at the last state transition.

Transition	Class	First place to look
`STARTING → RUNNING` but slow queries	Running / performance	Spark UI, Ganglia
`STARTING → TERMINATING`	Startup failure	Event log reason code
`RUNNING → TERMINATING`	Unexpected termination	Event log reason code
`RUNNING → RESIZING`	Autoscale event	Normal; not a problem
Never left `PENDING`	Cluster provisioning	Cloud provider limits

Step 2: Startup failures

Event log's TERMINATING event carries a reason code. The canonical ones:

Reason code	What it means	Fix
`CLOUD_PROVIDER_LAUNCH_FAILURE`	AWS/Azure/GCP could not provision the instance type	Check EC2 limits; try a different type; try a different AZ
`DRIVER_UNREACHABLE`	Databricks cannot talk to the driver node	Network issue: security groups, VPC peering, routing
`INIT_SCRIPT_FAILURE`	An init script returned non-zero	Read the init script logs in the cluster event log; fix the script
`CLOUD_PROVIDER_SHUTDOWN`	The cloud reclaimed the instance (spot preemption, maintenance)	Retry; if persistent, use on-demand for the driver
`INTERNAL_ERROR`	Databricks internal issue	Retry; if persistent, contact support

`CLOUD_PROVIDER_LAUNCH_FAILURE` in detail

Two common roots:

EC2 limit. AWS limits vCPU count per instance family per region. Request a raise via the console or try a different family.
AZ exhaustion. Spot capacity for a given instance type in a given AZ dries up. Switch to zone_id: auto so Databricks picks whichever AZ has capacity.

{
  "aws_attributes": {
    "availability": "SPOT_WITH_FALLBACK",
    "zone_id": "auto",
    "first_on_demand": 1
  }
}

`INIT_SCRIPT_FAILURE`

Init scripts live in a Unity Catalog volume or DBFS. When one fails:

Find the script at the path referenced in the cluster config.
Find the log. UI: Compute → cluster → Logs → init_scripts/<timestamp>-<node>/....
Reproduce locally (init scripts are bash).

Note

Init scripts run in cluster-startup order as root. A broken init script on a shared cluster blocks every user of that cluster. Test init scripts on a single-user cluster first; promote to cluster policies only after.

Step 3: Running but slow

Cluster is up. Jobs do not finish. Three places to look, in order:

Ganglia (cluster-level)

UI: Compute → cluster → Metrics → Ganglia.

Metric	Warning sign	What it means
CPU utilization constantly > 90%	Cluster saturated	Workload is CPU-bound; scale horizontally or vertically
Memory near limit	Risk of OOM	Bigger instance, or fewer parallel tasks per executor
Network I/O spikes	Large shuffle or S3 transfer	Optimize the query; broadcast the small side of joins
JVM GC time > 20%	Memory pressure	Increase executor memory

Spark UI (query-level)

UI: Compute → cluster → Spark UI → Jobs.

Find the slow job. Click through to the stage that is the bottleneck.

Look for:

Skewed partitions: one task takes 10x the time of the median. Salt the join key or enable AQE skew handling.
Spill to disk: execution memory exceeded; tasks spill intermediate state. Bigger executor memory, or change the operation.
Long shuffle read: task spends most of its time fetching from other executors. Reduce spark.sql.shuffle.partitions if you have too many tiny partitions; increase if too few giant ones.

Query profile (for SQL warehouses)

UI: SQL → Queries → the offending query → Query Profile.

Section	What to check
Planning	Long planning time → complex views, too many tables, stale statistics
Execution	Which operator is slowest (scan, join, sort, aggregate)
I/O	Rows scanned vs. rows returned (low ratio = missing partition pruning / Z-order)
Spill	Disk spill → insufficient memory
Photon	Is Photon enabled (always yes on Serverless SQL)

Common fixes:

Missing partition filter. Add one: WHERE dt >= '2026-03-01'.
Missing Z-order. OPTIMIZE prod.gold.events ZORDER BY (customer_id, event_type).
Stale statistics. ANALYZE TABLE prod.gold.events COMPUTE STATISTICS FOR ALL COLUMNS.
Full-table scans on Delta tables. Confirm the table has Delta statistics; confirm the planner is using them.

Step 4: Spot preemption

Symptom: a task suddenly fails with "executor lost". Cause: a spot instance got reclaimed by the cloud provider.

Mitigations:

{
  "aws_attributes": {
    "availability": "SPOT_WITH_FALLBACK",
    "first_on_demand": 1
  },
  "spark_conf": {
    "spark.task.maxFailures": "4"
  }
}

first_on_demand: 1 keeps the driver on an on-demand instance; losing the driver is fatal to the whole job.
SPOT_WITH_FALLBACK falls back to on-demand if spot is unavailable.
spark.task.maxFailures: 4 retries tasks when their executor dies. The default is 4 already; verify nothing has overridden it lower.

Warning

Never run a critical production job with first_on_demand: 0. The driver being on spot means any preemption kills the whole job; you get partial results and no retry semantics. On-demand for the driver is a tiny cost premium for a large reliability gain.

Step 5: Out-of-memory errors

Symptom: java.lang.OutOfMemoryError: Java heap space in executor logs.

Three fixes, in order of effort:

Enable adaptive query execution if not already:

spark.sql.adaptive.enabled=true
spark.sql.adaptive.autoOptimizeShuffle.enabled=true

Bigger instance type (horizontal scaling rarely fixes OOM; vertical usually does).
Change the operation: broadcast smaller tables, bucket larger ones, reduce partition size.

Specifically for Delta writes:

spark.databricks.delta.optimizeWrite.enabled=true
spark.databricks.delta.autoCompact.enabled=true

These combine small writes into larger files, which reduces downstream read cost and often reduces OOM risk for writes too.

Step 6: When in doubt, restart

Long-running all-purpose clusters accumulate state: broken temp views, stale broadcasts, half-done caches, classloader leaks. If diagnosis is taking more than 20 minutes and the workload is not time-critical, restart the cluster:

databricks clusters restart <cluster-id>

Note

Schedule a weekly restart for any all-purpose cluster that lives longer than a week. Accumulated state kills performance in ways that do not show up in a single metric; the restart is cheap compared to the debugging bills.

Operational reference

Event log events

STARTING → RUNNING: normal startup
STARTING → TERMINATING: startup failure (reason code above)
RUNNING → RESIZING: autoscale up or down
RUNNING → TERMINATING: unexpected termination
RUNNING → RESTARTING: manual or scheduled restart

Driver log locations

UI: Compute → cluster → Driver Logs.

stdout: print statements from your code.
stderr: Spark warnings, errors, stack traces.
log4j: detailed Spark logging (INFO-level spam; useful for edge cases).

When to escalate

Engage infrastructure or Databricks support when:

INTERNAL_ERROR persists across retries.
Workspace-wide launch failures with no apparent cloud-provider cause.
A cluster fails to terminate (stuck in TERMINATING).
Billing anomalies not explained by the cluster's config.

Step 1: Classify (30 seconds)

Step 2: Startup failures

CLOUD_PROVIDER_LAUNCH_FAILURE in detail

INIT_SCRIPT_FAILURE

Step 3: Running but slow

Ganglia (cluster-level)

Spark UI (query-level)

Query profile (for SQL warehouses)

Step 4: Spot preemption

Step 5: Out-of-memory errors

Step 6: When in doubt, restart

Operational reference

Event log events

Driver log locations

When to escalate

See also

`CLOUD_PROVIDER_LAUNCH_FAILURE` in detail

`INIT_SCRIPT_FAILURE`