Scan the first column for your symptom. The fix column is the first thing to try, not the only thing.
Cluster startup errors
CLOUD_PROVIDER_LAUNCH_FAILURE
The cloud could not provision the requested instance type.
Fix:
- Check EC2 / Azure / GCP vCPU limits in your region.
- Try a different instance family (
m5.2xlarge→r5.2xlarge). - Set
zone_id: autoso Databricks picks an AZ with capacity. - Consider
SPOT_WITH_FALLBACKso capacity shortages fall back to on-demand.
DRIVER_UNREACHABLE
Databricks cannot reach the driver node over the network.
Fix:
- Check security groups / NSGs allow the expected Databricks CIDR ranges.
- VPC peering and routing tables are intact.
- Re-create the cluster; occasional transient DNS issues resolve on retry.
INIT_SCRIPT_FAILURE
An init script returned non-zero.
Fix:
- UI: Compute → cluster → Logs →
init_scripts/<timestamp>-<node>/. - Read the captured stdout/stderr.
- Reproduce locally; init scripts are plain bash.
- Common causes: missing
pippackage version, apt lock held by another script, network egress blocked.
CLOUD_PROVIDER_SHUTDOWN
The cloud reclaimed the instance (spot preemption or maintenance).
Fix:
- Retry.
- If persistent during a critical window, switch driver to on-demand (
first_on_demand: 1).
INTERNAL_ERROR
Databricks platform issue.
Fix:
- Retry.
- Check the Databricks status page for regional incidents.
- Contact support with the cluster ID and time if persistent.
Unity Catalog permission errors
User does not have USE CATALOG on catalog 'prod'
Missing catalog-level grant.
Fix:
GRANT USE CATALOG ON CATALOG prod TO `<principal>`;
User does not have USE SCHEMA on schema
Missing schema-level grant. Required even if the user has a table-level grant.
Fix:
GRANT USE SCHEMA ON SCHEMA prod.silver TO `<principal>`;
User does not have SELECT on table
Missing table-level read grant.
Fix:
GRANT SELECT ON TABLE prod.silver.customers TO `<principal>`;
User does not have CREATE TABLE
Missing schema-level create grant.
Fix:
GRANT CREATE TABLE ON SCHEMA prod.silver TO `<principal>`;
Only the owner can grant permissions
Non-owner principal trying to GRANT.
Fix: Transfer ownership, or have an owner execute the grant:
ALTER TABLE prod.gold.revenue_summary OWNER TO `data-engineering`;
Delta / Lakehouse errors
DELTA_MISSING_COLUMN
A query references a column that no longer exists in the Delta table.
Fix:
- Check the compiled SQL (or the
ref()chain if dbt). - If upstream was rewritten, update consumers to the new column set.
--full-refreshfor incremental models whose schema changed.
MERGE_CARDINALITY_VIOLATION
MERGE INTO found multiple source rows matching a single target row.
Fix: Deduplicate on the merge key before the merge. In dbt incremental:
with ranked as (
select *, row_number() over (
partition by order_id order by _loaded_at desc
) as rn
from {{ ref('stg_orders') }}
)
select * except(rn) from ranked where rn = 1
SCHEMA_CHANGE_NOT_ALLOWED
Write's schema differs from the target's.
Fix:
- For incremental dbt:
on_schema_change: 'sync_all_columns'or--full-refresh. - For raw Spark: enable
spark.databricks.delta.schema.autoMerge.enabled. - For LDP: let type widening handle it, or full-refresh the affected table.
STATEMENT_TIMEOUT
SQL query exceeded the warehouse timeout.
Fix:
- Read the query profile; find the slow operator.
- Add partition filters or Z-order keys.
- Break the query into intermediate steps.
- Last resort: bump warehouse size or client timeout.
Warning
Bumping warehouse size is an anti-fix. It hides slow SQL that will cost more next month as data grows. Always look at the query plan before changing the warehouse.
WAREHOUSE_NOT_RUNNING
The SQL warehouse is stopped.
Fix:
- Start it: UI or
databricks sql warehouses start <id>. - Check
auto_stop_mins; it may be more aggressive than the workload's cadence.
Lakeflow Declarative Pipeline errors
Schema evolution conflict mid-run
LDP detects a column type or removal the current pipeline cannot evolve.
Fix:
- For additive: enable
spark.databricks.delta.schema.autoMerge.enabledin pipeline config. - For type changes or removals: full-refresh the affected table.
- For breaking changes: version the pipeline; consumers migrate deliberately.
Checkpoint corruption
StreamingQueryException: Error reading checkpoint
InvalidOffsetException: ...
Fix:
- Full-refresh the affected streaming table; this resets the checkpoint.
- Investigate the cause: cluster crash during commit, storage eventual consistency, or an orchestrator killing the pipeline mid-flush.
expect_or_fail halted the pipeline
Hard expectation failed.
Fix:
- Query the pipeline's event log for the specific expectation:
SELECT * FROM event_log(TABLE(prod.silver.my_pipeline)) WHERE details:flow_progress:data_quality:expectations IS NOT NULL ORDER BY timestamp DESC LIMIT 20; - Investigate the source data.
- Either fix the source, relax the expectation, or switch to
expect_or_drop. - Resume the pipeline; it picks up from the failed update.
Pipeline stuck in INITIALIZING
Cluster launch failure.
Fix:
- Cluster policy violation; check the pipeline's cluster config against the policy.
- Cloud quota exhausted; check EC2 limits.
- Switch to serverless LDP to sidestep cluster provisioning.
Asset Bundle errors
bundle validate reports "variable not defined"
Target references a variable not declared at bundle scope.
Fix:
# In databricks.yml
variables:
catalog:
description: UC catalog for this env
warehouse_id:
description: SQL warehouse for transforms
Then populate per-target.
bundle deploy fails with PERMISSION_DENIED
The deploying principal lacks grants on the target workspace.
Fix:
-- Grant to the service principal
GRANT USE CATALOG ON CATALOG prod TO `<sp-name>`;
GRANT CREATE JOB ON WORKSPACE TO `<sp-name>`;
GRANT CREATE PIPELINE ON CATALOG prod TO `<sp-name>`;
Prod deploy overwrites a UI-only change
Expected and correct. Bundle state is the source of truth.
Fix: Replay the change in the bundle; deploy. Train the team that UI edits on bundle-managed resources are lost on the next deploy.
Connection / auth errors
Could not find profile named '<X>'
CLI profile mismatch or ~/.databrickscfg missing.
Fix:
databricks auth profiles # list configured
databricks auth login --host <workspace-url> \
--profile <name> # add or update
Connection refused / Connection timeout
Network-level.
Fix:
- Verify the host URL is correct, no trailing slash, with https://.
- VPN or corporate network may be blocking egress.
databricks auth describefor diagnostic info.
Token expired or Invalid token
PAT expired or revoked.
Fix:
- Re-authenticate:
databricks auth login. - For CI, switch to workload identity federation (OIDC) so you never manage PATs.
Lakebase errors
too many connections
Application opened more connections than the instance allows.
Fix:
- Add a connection pool; size
maxconnto 50-80% of instance's max. - Investigate the app: per-request connections are the usual culprit.
- Resize the Lakebase instance if the pool is rightsized and still insufficient.
Sync table rows stale
Sync cadence slower than consumer expects.
Fix:
- Tighten
sync_schedule. - For near-real-time: switch to continuous sync mode.
- For cost-sensitive cases: add a cache layer in the app (Redis, in-process LRU) instead of polling faster.
Quick diagnostic commands
# CLI debug level
databricks --debug <command>
# Auth sanity
databricks auth describe
databricks current-user me
# Bundle diff preview
databricks bundle validate --target prod
databricks bundle summary --target prod
# Last run's result for a job
databricks jobs list-runs --job-id <id> --limit 1 --output JSON | jq
See also
- Cluster troubleshooting guide — the 5-minute procedure.
- Unity Catalog — the permission model that causes most grant errors.
- Asset Bundles — the deploy workflow.