Common Databricks errors and resolutions

Reference Databricks

Scan the first column for your symptom. The fix column is the first thing to try, not the only thing.

Cluster startup errors

`CLOUD_PROVIDER_LAUNCH_FAILURE`

The cloud could not provision the requested instance type.

Fix:

Check EC2 / Azure / GCP vCPU limits in your region.
Try a different instance family (m5.2xlarge → r5.2xlarge).
Set zone_id: auto so Databricks picks an AZ with capacity.
Consider SPOT_WITH_FALLBACK so capacity shortages fall back to on-demand.

`DRIVER_UNREACHABLE`

Databricks cannot reach the driver node over the network.

Fix:

Check security groups / NSGs allow the expected Databricks CIDR ranges.
VPC peering and routing tables are intact.
Re-create the cluster; occasional transient DNS issues resolve on retry.

`INIT_SCRIPT_FAILURE`

An init script returned non-zero.

Fix:

UI: Compute → cluster → Logs → init_scripts/<timestamp>-<node>/.
Read the captured stdout/stderr.
Reproduce locally; init scripts are plain bash.
Common causes: missing pip package version, apt lock held by another script, network egress blocked.

`CLOUD_PROVIDER_SHUTDOWN`

The cloud reclaimed the instance (spot preemption or maintenance).

Fix:

Retry.
If persistent during a critical window, switch driver to on-demand (first_on_demand: 1).

`INTERNAL_ERROR`

Databricks platform issue.

Fix:

Retry.
Check the Databricks status page for regional incidents.
Contact support with the cluster ID and time if persistent.

Unity Catalog permission errors

`User does not have USE CATALOG on catalog 'prod'`

Missing catalog-level grant.

Fix:

GRANT USE CATALOG ON CATALOG prod TO `<principal>`;

`User does not have USE SCHEMA on schema`

Missing schema-level grant. Required even if the user has a table-level grant.

Fix:

GRANT USE SCHEMA ON SCHEMA prod.silver TO `<principal>`;

`User does not have SELECT on table`

Missing table-level read grant.

Fix:

GRANT SELECT ON TABLE prod.silver.customers TO `<principal>`;

`User does not have CREATE TABLE`

Missing schema-level create grant.

Fix:

GRANT CREATE TABLE ON SCHEMA prod.silver TO `<principal>`;

`Only the owner can grant permissions`

Non-owner principal trying to GRANT.

Fix: Transfer ownership, or have an owner execute the grant:

ALTER TABLE prod.gold.revenue_summary OWNER TO `data-engineering`;

Delta / Lakehouse errors

`DELTA_MISSING_COLUMN`

A query references a column that no longer exists in the Delta table.

Fix:

Check the compiled SQL (or the ref() chain if dbt).
If upstream was rewritten, update consumers to the new column set.
--full-refresh for incremental models whose schema changed.

`MERGE_CARDINALITY_VIOLATION`

MERGE INTO found multiple source rows matching a single target row.

Fix: Deduplicate on the merge key before the merge. In dbt incremental:

with ranked as (
    select *, row_number() over (
        partition by order_id order by _loaded_at desc
    ) as rn
    from {{ ref('stg_orders') }}
)
select * except(rn) from ranked where rn = 1

`SCHEMA_CHANGE_NOT_ALLOWED`

Write's schema differs from the target's.

Fix:

For incremental dbt: on_schema_change: 'sync_all_columns' or --full-refresh.
For raw Spark: enable spark.databricks.delta.schema.autoMerge.enabled.
For LDP: let type widening handle it, or full-refresh the affected table.

`STATEMENT_TIMEOUT`

SQL query exceeded the warehouse timeout.

Fix:

Read the query profile; find the slow operator.
Add partition filters or Z-order keys.
Break the query into intermediate steps.
Last resort: bump warehouse size or client timeout.

Warning

Bumping warehouse size is an anti-fix. It hides slow SQL that will cost more next month as data grows. Always look at the query plan before changing the warehouse.

`WAREHOUSE_NOT_RUNNING`

The SQL warehouse is stopped.

Fix:

Start it: UI or databricks sql warehouses start <id>.
Check auto_stop_mins; it may be more aggressive than the workload's cadence.

Lakeflow Declarative Pipeline errors

Schema evolution conflict mid-run

LDP detects a column type or removal the current pipeline cannot evolve.

Fix:

For additive: enable spark.databricks.delta.schema.autoMerge.enabled in pipeline config.
For type changes or removals: full-refresh the affected table.
For breaking changes: version the pipeline; consumers migrate deliberately.

Checkpoint corruption

StreamingQueryException: Error reading checkpoint
InvalidOffsetException: ...

Fix:

Full-refresh the affected streaming table; this resets the checkpoint.
Investigate the cause: cluster crash during commit, storage eventual consistency, or an orchestrator killing the pipeline mid-flush.

`expect_or_fail` halted the pipeline

Hard expectation failed.

Fix:

Query the pipeline's event log for the specific expectation:

SELECT * FROM event_log(TABLE(prod.silver.my_pipeline))
WHERE details:flow_progress:data_quality:expectations IS NOT NULL
ORDER BY timestamp DESC LIMIT 20;

Investigate the source data.
Either fix the source, relax the expectation, or switch to expect_or_drop.
Resume the pipeline; it picks up from the failed update.

Pipeline stuck in `INITIALIZING`

Cluster launch failure.

Fix:

Cluster policy violation; check the pipeline's cluster config against the policy.
Cloud quota exhausted; check EC2 limits.
Switch to serverless LDP to sidestep cluster provisioning.

Asset Bundle errors

`bundle validate` reports "variable not defined"

Target references a variable not declared at bundle scope.

Fix:

# In databricks.yml
variables:
  catalog:
    description: UC catalog for this env
  warehouse_id:
    description: SQL warehouse for transforms

Then populate per-target.

`bundle deploy` fails with `PERMISSION_DENIED`

The deploying principal lacks grants on the target workspace.

Fix:

-- Grant to the service principal
GRANT USE CATALOG ON CATALOG prod TO `<sp-name>`;
GRANT CREATE JOB ON WORKSPACE TO `<sp-name>`;
GRANT CREATE PIPELINE ON CATALOG prod TO `<sp-name>`;

Prod deploy overwrites a UI-only change

Expected and correct. Bundle state is the source of truth.

Fix: Replay the change in the bundle; deploy. Train the team that UI edits on bundle-managed resources are lost on the next deploy.

Connection / auth errors

`Could not find profile named '<X>'`

CLI profile mismatch or ~/.databrickscfg missing.

Fix:

databricks auth profiles                           # list configured
databricks auth login --host <workspace-url> \
  --profile <name>                                  # add or update

`Connection refused` / `Connection timeout`

Network-level.

Fix:

Verify the host URL is correct, no trailing slash, with https://.
VPN or corporate network may be blocking egress.
databricks auth describe for diagnostic info.

`Token expired` or `Invalid token`

PAT expired or revoked.

Fix:

Re-authenticate: databricks auth login.
For CI, switch to workload identity federation (OIDC) so you never manage PATs.

Lakebase errors

`too many connections`

Application opened more connections than the instance allows.

Fix:

Add a connection pool; size maxconn to 50-80% of instance's max.
Investigate the app: per-request connections are the usual culprit.
Resize the Lakebase instance if the pool is rightsized and still insufficient.

Sync table rows stale

Sync cadence slower than consumer expects.

Fix:

Tighten sync_schedule.
For near-real-time: switch to continuous sync mode.
For cost-sensitive cases: add a cache layer in the app (Redis, in-process LRU) instead of polling faster.

Quick diagnostic commands

# CLI debug level
databricks --debug <command>

# Auth sanity
databricks auth describe
databricks current-user me

# Bundle diff preview
databricks bundle validate --target prod
databricks bundle summary --target prod

# Last run's result for a job
databricks jobs list-runs --job-id <id> --limit 1 --output JSON | jq

Cluster startup errors

CLOUD_PROVIDER_LAUNCH_FAILURE

DRIVER_UNREACHABLE

INIT_SCRIPT_FAILURE

CLOUD_PROVIDER_SHUTDOWN

INTERNAL_ERROR

Unity Catalog permission errors

User does not have USE CATALOG on catalog 'prod'

User does not have USE SCHEMA on schema

User does not have SELECT on table

User does not have CREATE TABLE

Only the owner can grant permissions

Delta / Lakehouse errors

DELTA_MISSING_COLUMN

MERGE_CARDINALITY_VIOLATION

SCHEMA_CHANGE_NOT_ALLOWED

STATEMENT_TIMEOUT

WAREHOUSE_NOT_RUNNING

Lakeflow Declarative Pipeline errors

Schema evolution conflict mid-run

Checkpoint corruption

expect_or_fail halted the pipeline

Pipeline stuck in INITIALIZING

Asset Bundle errors

bundle validate reports "variable not defined"

bundle deploy fails with PERMISSION_DENIED

Prod deploy overwrites a UI-only change

Connection / auth errors

Could not find profile named '<X>'

Connection refused / Connection timeout

Token expired or Invalid token

Lakebase errors

too many connections

Sync table rows stale

Quick diagnostic commands

See also

`CLOUD_PROVIDER_LAUNCH_FAILURE`

`DRIVER_UNREACHABLE`

`INIT_SCRIPT_FAILURE`

`CLOUD_PROVIDER_SHUTDOWN`

`INTERNAL_ERROR`

`User does not have USE CATALOG on catalog 'prod'`

`User does not have USE SCHEMA on schema`

`User does not have SELECT on table`

`User does not have CREATE TABLE`

`Only the owner can grant permissions`

`DELTA_MISSING_COLUMN`

`MERGE_CARDINALITY_VIOLATION`

`SCHEMA_CHANGE_NOT_ALLOWED`

`STATEMENT_TIMEOUT`

`WAREHOUSE_NOT_RUNNING`

`expect_or_fail` halted the pipeline

Pipeline stuck in `INITIALIZING`

`bundle validate` reports "variable not defined"

`bundle deploy` fails with `PERMISSION_DENIED`

`Could not find profile named '<X>'`

`Connection refused` / `Connection timeout`

`Token expired` or `Invalid token`

`too many connections`