Scan the first column for your symptom. The fix column is the first thing to try, not the only thing.

Task execution errors

Negsignal.SIGKILL / exit code -9 / 137

Task log ends abruptly, no Python traceback. The process was killed by the kernel or Kubernetes.

Root cause: Out-of-memory kill. The task exceeded its memory limit.

Fix:

  1. Confirm OOM:
    kubectl describe pod <pod-name> -n airflow | grep -A2 OOMKilled
    
  2. Increase resources.limits.memory in the task's executor_config or worker deployment.
  3. Better: offload the work from the Airflow worker to a dedicated compute system. See the supervisor model.

AirflowTaskTimeout

airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 1234

Root cause: Task exceeded execution_timeout.

Fix:

  1. Is the task genuinely slow? Optimize the query or logic.
  2. Is it stuck on a downstream resource (DB lock, unresponsive API)? Fix the downstream.
  3. Raise execution_timeout only if the new duration is a permanent expectation.

AirflowSensorTimeout

airflow.exceptions.AirflowSensorTimeout

Root cause: Sensor did not detect its target condition within timeout.

Fix by sensor type:

SensorCheck
S3KeySensorS3 key actually arrived; upstream job completed; bucket and key correct
ExternalTaskSensorUpstream DAG finished; execution_delta matches upstream cadence
HttpSensorTarget API responded; response_check predicate correct

Warning

Do not bump sensor timeouts as a first response. 4-hour sensor timeouts mean 4 hours of silence before anyone knows something is wrong. Fix the upstream; keep the timeout honest.

Max retries reached

Task instance has been set to failed, max retries reached

Root cause: All retries exhausted with the same underlying error.

Fix:

  1. Read logs for each try number; same error or different?
  2. If same every time, the failure is deterministic. Classify as logic / config / permission / data and fix root cause. Do not clear-and-retry.
  3. If different, investigate transient causes (intermittent network, rate-limit storms).

AirflowFailException

airflow.exceptions.AirflowFailException: <message>

Intentional. The task author decided this condition was unrecoverable.

Fix: Read the message; fix what it is pointing at. Do not clear-and-retry without addressing the underlying issue.

Scheduler errors

Scheduler heartbeat stale

$ curl http://$AIRFLOW_HOST/api/v1/health | jq
{"scheduler": {"status": "unhealthy", ...}}

Root cause: Scheduler is stuck, OOM, or missing.

Fix:

  1. Restart the scheduler.
  2. Check metadata DB health (airflow db check).
  3. Look for DAG-parse timeouts: airflow dags report — any DAG taking > 30 s?

DAGs missing from the UI

Root cause: Parse error in one or more DAG files.

Fix:

  1. astro dev run dags list-import-errors or airflow dags list-import-errors.
  2. Fix the parse error; wait for re-parse (~30 s).

DAG parse slow

Root cause: Top-level imports or API calls in DAG file.

Fix:

  1. Move heavy imports (pandas, boto3, etc.) into task callables.
  2. Move any API call / DB query out of DAG definition.
  3. Verify: airflow dags report should show sub-second parse per DAG.

Note

A DAG taking 10+ seconds to parse slows every scheduler cycle and blocks other DAGs from updating. DAG definitions should be pure Python with fast parse times. Heavy work lives in tasks, not at module scope.

Resource exhaustion

All workers busy

Root cause: One of the concurrency limits is capping.

Fix: See concurrency reference. Work through:

  1. airflow config get-value core parallelism — cluster ceiling.
  2. DAG-level max_active_tasks, max_active_runs.
  3. Pools at 100% utilization.
  4. Synchronous sensors holding slots.

Pool full

Airflow task pool 'salesforce_api' is at capacity (5/5)

Root cause: Too many concurrent tasks want the resource.

Fix:

  1. If the pool size matches the real rate limit (correct), just wait; tasks queue.
  2. If tasks run longer than expected, investigate per-task duration.
  3. If the rate limit was raised (e.g., Salesforce plan upgrade), update the pool size.

Triggerer missing

Root cause: Deferrable tasks require a Triggerer process; if none is deployed, async sensors fail silently.

Fix:

  1. Check Triggerer pods: kubectl get pods -l component=triggerer.
  2. On Astronomer: enable the Triggerer on the deployment.
  3. On OSS: run airflow triggerer as a separate process.

Connection / auth errors

ConnectionId not found

Root cause: Referenced connection ID is not in the metadata DB or configured backend.

Fix:

airflow connections list | grep <id>        # verify presence
airflow connections add <id> --conn-type …   # add if missing

401 / 403 from a provider

Root cause: Credentials expired, rotated, or lacking scope.

Fix:

  1. Verify the secret backend is reachable (AWS Secrets Manager, Vault).
  2. Rotate the credential if it is aged.
  3. Check the underlying service (Databricks, Salesforce) for recent permission changes.

aws_default authorization errors on S3

Root cause: IRSA role missing a policy, or task not inheriting the right identity.

Fix:

aws sts get-caller-identity     # run inside the Airflow worker
aws s3 ls s3://<bucket>/        # confirm the role has access

If identity is wrong, the worker is using the default node role instead of a task-scoped role. Fix the IRSA setup.

Databricks integration errors

DatabricksRunNowOperator hangs forever

Root cause: wait_for_termination=True (default) polls synchronously, holding a worker slot.

Fix: Switch to deferrable:

run = DatabricksRunNowOperator(
    ...,
    deferrable=True,
)

Databricks job finished but Airflow still waiting

Root cause: Triggerer missing or stuck.

Fix: Verify Triggerer pods running; restart if stale.

Run failed with error 'X' but the job succeeded in Databricks UI

Root cause: The Databricks CLI or SDK version in the provider mismatches the job definition format.

Fix: Pin the apache-airflow-providers-databricks version and the Databricks SDK version together; upgrade in lockstep.

Backfill errors

Backfill dumps a flood of runs

Root cause: catchup=True on a paused DAG.

Fix: Always catchup=False. Use airflow dags backfill for explicit backfills.

Overlap corrupting state

Root cause: max_active_runs > 1 with overlapping stateful writes.

Fix: Set max_active_runs=1 on the DAG. Re-run the backfill sequentially.

Danger

Never backfill a stateful DAG with max_active_runs > 1. Two runs writing to the same partition simultaneously corrupt the partition; the corruption is silent and hard to reverse. Set the limit, then backfill.

Deployment errors

astro deploy fails with image build error

Root cause: Package mismatch, outdated base image, or bad Dockerfile.

Fix:

  1. Build locally first: astro dev start succeeds?
  2. Check requirements.txt for conflicting pins.
  3. Check packages.txt for apt packages not available in the base image.

DAGs deploy but do not appear in the UI

Root cause: Parse error (most common), or dag folder not refreshing.

Fix:

  1. astro dev run dags list-import-errors or its production equivalent via the UI.
  2. Fix parse errors; wait for next parse cycle.
  3. Check [scheduler] dag_dir_list_interval; if set too high, new DAGs take a while to appear.

Astronomer-specific

Deployment UI shows stale metrics

Root cause: Metrics pipeline delay, not a real problem.

Fix: Refresh; give it a minute. If persistent, check Astronomer status page.

Cannot scale workers up

Root cause: Deployment quota cap.

Fix: Check the deployment's resource cap in the Astro UI; request an increase if needed.

Quick diagnostic commands

# Everything
airflow info                                     # Version info

# Task-specific
airflow tasks states-for-dag-run <dag_id> <run_id>
airflow tasks log <dag_id> <task_id> <date>

# DAG-specific
airflow dags state <dag_id> <date>
airflow dags list-runs --dag-id <dag_id> --state failed

# Parse errors
airflow dags list-import-errors

# Which DAGs are slow to parse
airflow dags report

# Metadata DB health
airflow db check

# Pool utilization
airflow pools list

# Scheduler health
curl $AIRFLOW_HOST/api/v1/health

See also