Scan the first column for your symptom. The fix column is the first thing to try, not the only thing.

Cluster startup errors

CLOUD_PROVIDER_LAUNCH_FAILURE

The cloud could not provision the requested instance type.

Fix:

  1. Check EC2 / Azure / GCP vCPU limits in your region.
  2. Try a different instance family (m5.2xlarger5.2xlarge).
  3. Set zone_id: auto so Databricks picks an AZ with capacity.
  4. Consider SPOT_WITH_FALLBACK so capacity shortages fall back to on-demand.

DRIVER_UNREACHABLE

Databricks cannot reach the driver node over the network.

Fix:

  1. Check security groups / NSGs allow the expected Databricks CIDR ranges.
  2. VPC peering and routing tables are intact.
  3. Re-create the cluster; occasional transient DNS issues resolve on retry.

INIT_SCRIPT_FAILURE

An init script returned non-zero.

Fix:

  1. UI: Compute → cluster → Logs → init_scripts/<timestamp>-<node>/.
  2. Read the captured stdout/stderr.
  3. Reproduce locally; init scripts are plain bash.
  4. Common causes: missing pip package version, apt lock held by another script, network egress blocked.

CLOUD_PROVIDER_SHUTDOWN

The cloud reclaimed the instance (spot preemption or maintenance).

Fix:

  1. Retry.
  2. If persistent during a critical window, switch driver to on-demand (first_on_demand: 1).

INTERNAL_ERROR

Databricks platform issue.

Fix:

  1. Retry.
  2. Check the Databricks status page for regional incidents.
  3. Contact support with the cluster ID and time if persistent.

Unity Catalog permission errors

User does not have USE CATALOG on catalog 'prod'

Missing catalog-level grant.

Fix:

GRANT USE CATALOG ON CATALOG prod TO `<principal>`;

User does not have USE SCHEMA on schema

Missing schema-level grant. Required even if the user has a table-level grant.

Fix:

GRANT USE SCHEMA ON SCHEMA prod.silver TO `<principal>`;

User does not have SELECT on table

Missing table-level read grant.

Fix:

GRANT SELECT ON TABLE prod.silver.customers TO `<principal>`;

User does not have CREATE TABLE

Missing schema-level create grant.

Fix:

GRANT CREATE TABLE ON SCHEMA prod.silver TO `<principal>`;

Only the owner can grant permissions

Non-owner principal trying to GRANT.

Fix: Transfer ownership, or have an owner execute the grant:

ALTER TABLE prod.gold.revenue_summary OWNER TO `data-engineering`;

Delta / Lakehouse errors

DELTA_MISSING_COLUMN

A query references a column that no longer exists in the Delta table.

Fix:

  1. Check the compiled SQL (or the ref() chain if dbt).
  2. If upstream was rewritten, update consumers to the new column set.
  3. --full-refresh for incremental models whose schema changed.

MERGE_CARDINALITY_VIOLATION

MERGE INTO found multiple source rows matching a single target row.

Fix: Deduplicate on the merge key before the merge. In dbt incremental:

with ranked as (
    select *, row_number() over (
        partition by order_id order by _loaded_at desc
    ) as rn
    from {{ ref('stg_orders') }}
)
select * except(rn) from ranked where rn = 1

SCHEMA_CHANGE_NOT_ALLOWED

Write's schema differs from the target's.

Fix:

  1. For incremental dbt: on_schema_change: 'sync_all_columns' or --full-refresh.
  2. For raw Spark: enable spark.databricks.delta.schema.autoMerge.enabled.
  3. For LDP: let type widening handle it, or full-refresh the affected table.

STATEMENT_TIMEOUT

SQL query exceeded the warehouse timeout.

Fix:

  1. Read the query profile; find the slow operator.
  2. Add partition filters or Z-order keys.
  3. Break the query into intermediate steps.
  4. Last resort: bump warehouse size or client timeout.

Warning

Bumping warehouse size is an anti-fix. It hides slow SQL that will cost more next month as data grows. Always look at the query plan before changing the warehouse.

WAREHOUSE_NOT_RUNNING

The SQL warehouse is stopped.

Fix:

  1. Start it: UI or databricks sql warehouses start <id>.
  2. Check auto_stop_mins; it may be more aggressive than the workload's cadence.

Lakeflow Declarative Pipeline errors

Schema evolution conflict mid-run

LDP detects a column type or removal the current pipeline cannot evolve.

Fix:

  1. For additive: enable spark.databricks.delta.schema.autoMerge.enabled in pipeline config.
  2. For type changes or removals: full-refresh the affected table.
  3. For breaking changes: version the pipeline; consumers migrate deliberately.

Checkpoint corruption

StreamingQueryException: Error reading checkpoint
InvalidOffsetException: ...

Fix:

  1. Full-refresh the affected streaming table; this resets the checkpoint.
  2. Investigate the cause: cluster crash during commit, storage eventual consistency, or an orchestrator killing the pipeline mid-flush.

expect_or_fail halted the pipeline

Hard expectation failed.

Fix:

  1. Query the pipeline's event log for the specific expectation:
    SELECT * FROM event_log(TABLE(prod.silver.my_pipeline))
    WHERE details:flow_progress:data_quality:expectations IS NOT NULL
    ORDER BY timestamp DESC LIMIT 20;
    
  2. Investigate the source data.
  3. Either fix the source, relax the expectation, or switch to expect_or_drop.
  4. Resume the pipeline; it picks up from the failed update.

Pipeline stuck in INITIALIZING

Cluster launch failure.

Fix:

  1. Cluster policy violation; check the pipeline's cluster config against the policy.
  2. Cloud quota exhausted; check EC2 limits.
  3. Switch to serverless LDP to sidestep cluster provisioning.

Asset Bundle errors

bundle validate reports "variable not defined"

Target references a variable not declared at bundle scope.

Fix:

# In databricks.yml
variables:
  catalog:
    description: UC catalog for this env
  warehouse_id:
    description: SQL warehouse for transforms

Then populate per-target.

bundle deploy fails with PERMISSION_DENIED

The deploying principal lacks grants on the target workspace.

Fix:

-- Grant to the service principal
GRANT USE CATALOG ON CATALOG prod TO `<sp-name>`;
GRANT CREATE JOB ON WORKSPACE TO `<sp-name>`;
GRANT CREATE PIPELINE ON CATALOG prod TO `<sp-name>`;

Prod deploy overwrites a UI-only change

Expected and correct. Bundle state is the source of truth.

Fix: Replay the change in the bundle; deploy. Train the team that UI edits on bundle-managed resources are lost on the next deploy.

Connection / auth errors

Could not find profile named '<X>'

CLI profile mismatch or ~/.databrickscfg missing.

Fix:

databricks auth profiles                           # list configured
databricks auth login --host <workspace-url> \
  --profile <name>                                  # add or update

Connection refused / Connection timeout

Network-level.

Fix:

  1. Verify the host URL is correct, no trailing slash, with https://.
  2. VPN or corporate network may be blocking egress.
  3. databricks auth describe for diagnostic info.

Token expired or Invalid token

PAT expired or revoked.

Fix:

  1. Re-authenticate: databricks auth login.
  2. For CI, switch to workload identity federation (OIDC) so you never manage PATs.

Lakebase errors

too many connections

Application opened more connections than the instance allows.

Fix:

  1. Add a connection pool; size maxconn to 50-80% of instance's max.
  2. Investigate the app: per-request connections are the usual culprit.
  3. Resize the Lakebase instance if the pool is rightsized and still insufficient.

Sync table rows stale

Sync cadence slower than consumer expects.

Fix:

  1. Tighten sync_schedule.
  2. For near-real-time: switch to continuous sync mode.
  3. For cost-sensitive cases: add a cache layer in the app (Redis, in-process LRU) instead of polling faster.

Quick diagnostic commands

# CLI debug level
databricks --debug <command>

# Auth sanity
databricks auth describe
databricks current-user me

# Bundle diff preview
databricks bundle validate --target prod
databricks bundle summary --target prod

# Last run's result for a job
databricks jobs list-runs --job-id <id> --limit 1 --output JSON | jq

See also