Scan the first column for your symptom. The fix column is the first thing to try, not the only thing.
Task execution errors
Negsignal.SIGKILL / exit code -9 / 137
Task log ends abruptly, no Python traceback. The process was killed by the kernel or Kubernetes.
Root cause: Out-of-memory kill. The task exceeded its memory limit.
Fix:
- Confirm OOM:
kubectl describe pod <pod-name> -n airflow | grep -A2 OOMKilled - Increase
resources.limits.memoryin the task'sexecutor_configor worker deployment. - Better: offload the work from the Airflow worker to a dedicated compute system. See the supervisor model.
AirflowTaskTimeout
airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 1234
Root cause: Task exceeded execution_timeout.
Fix:
- Is the task genuinely slow? Optimize the query or logic.
- Is it stuck on a downstream resource (DB lock, unresponsive API)? Fix the downstream.
- Raise
execution_timeoutonly if the new duration is a permanent expectation.
AirflowSensorTimeout
airflow.exceptions.AirflowSensorTimeout
Root cause: Sensor did not detect its target condition within timeout.
Fix by sensor type:
| Sensor | Check |
|---|---|
S3KeySensor | S3 key actually arrived; upstream job completed; bucket and key correct |
ExternalTaskSensor | Upstream DAG finished; execution_delta matches upstream cadence |
HttpSensor | Target API responded; response_check predicate correct |
Warning
Do not bump sensor timeouts as a first response. 4-hour sensor timeouts mean 4 hours of silence before anyone knows something is wrong. Fix the upstream; keep the timeout honest.
Max retries reached
Task instance has been set to failed, max retries reached
Root cause: All retries exhausted with the same underlying error.
Fix:
- Read logs for each try number; same error or different?
- If same every time, the failure is deterministic. Classify as logic / config / permission / data and fix root cause. Do not clear-and-retry.
- If different, investigate transient causes (intermittent network, rate-limit storms).
AirflowFailException
airflow.exceptions.AirflowFailException: <message>
Intentional. The task author decided this condition was unrecoverable.
Fix: Read the message; fix what it is pointing at. Do not clear-and-retry without addressing the underlying issue.
Scheduler errors
Scheduler heartbeat stale
$ curl http://$AIRFLOW_HOST/api/v1/health | jq
{"scheduler": {"status": "unhealthy", ...}}
Root cause: Scheduler is stuck, OOM, or missing.
Fix:
- Restart the scheduler.
- Check metadata DB health (
airflow db check). - Look for DAG-parse timeouts:
airflow dags report— any DAG taking > 30 s?
DAGs missing from the UI
Root cause: Parse error in one or more DAG files.
Fix:
astro dev run dags list-import-errorsorairflow dags list-import-errors.- Fix the parse error; wait for re-parse (~30 s).
DAG parse slow
Root cause: Top-level imports or API calls in DAG file.
Fix:
- Move heavy imports (pandas, boto3, etc.) into task callables.
- Move any API call / DB query out of DAG definition.
- Verify:
airflow dags reportshould show sub-second parse per DAG.
Note
A DAG taking 10+ seconds to parse slows every scheduler cycle and blocks other DAGs from updating. DAG definitions should be pure Python with fast parse times. Heavy work lives in tasks, not at module scope.
Resource exhaustion
All workers busy
Root cause: One of the concurrency limits is capping.
Fix: See concurrency reference. Work through:
airflow config get-value core parallelism— cluster ceiling.- DAG-level
max_active_tasks,max_active_runs. - Pools at 100% utilization.
- Synchronous sensors holding slots.
Pool full
Airflow task pool 'salesforce_api' is at capacity (5/5)
Root cause: Too many concurrent tasks want the resource.
Fix:
- If the pool size matches the real rate limit (correct), just wait; tasks queue.
- If tasks run longer than expected, investigate per-task duration.
- If the rate limit was raised (e.g., Salesforce plan upgrade), update the pool size.
Triggerer missing
Root cause: Deferrable tasks require a Triggerer process; if none is deployed, async sensors fail silently.
Fix:
- Check Triggerer pods:
kubectl get pods -l component=triggerer. - On Astronomer: enable the Triggerer on the deployment.
- On OSS: run
airflow triggereras a separate process.
Connection / auth errors
ConnectionId not found
Root cause: Referenced connection ID is not in the metadata DB or configured backend.
Fix:
airflow connections list | grep <id> # verify presence
airflow connections add <id> --conn-type … # add if missing
401 / 403 from a provider
Root cause: Credentials expired, rotated, or lacking scope.
Fix:
- Verify the secret backend is reachable (AWS Secrets Manager, Vault).
- Rotate the credential if it is aged.
- Check the underlying service (Databricks, Salesforce) for recent permission changes.
aws_default authorization errors on S3
Root cause: IRSA role missing a policy, or task not inheriting the right identity.
Fix:
aws sts get-caller-identity # run inside the Airflow worker
aws s3 ls s3://<bucket>/ # confirm the role has access
If identity is wrong, the worker is using the default node role instead of a task-scoped role. Fix the IRSA setup.
Databricks integration errors
DatabricksRunNowOperator hangs forever
Root cause: wait_for_termination=True (default) polls synchronously, holding a worker slot.
Fix: Switch to deferrable:
run = DatabricksRunNowOperator(
...,
deferrable=True,
)
Databricks job finished but Airflow still waiting
Root cause: Triggerer missing or stuck.
Fix: Verify Triggerer pods running; restart if stale.
Run failed with error 'X' but the job succeeded in Databricks UI
Root cause: The Databricks CLI or SDK version in the provider mismatches the job definition format.
Fix: Pin the apache-airflow-providers-databricks version and the Databricks SDK version together; upgrade in lockstep.
Backfill errors
Backfill dumps a flood of runs
Root cause: catchup=True on a paused DAG.
Fix: Always catchup=False. Use airflow dags backfill for explicit backfills.
Overlap corrupting state
Root cause: max_active_runs > 1 with overlapping stateful writes.
Fix: Set max_active_runs=1 on the DAG. Re-run the backfill sequentially.
Danger
Never backfill a stateful DAG with max_active_runs > 1. Two runs writing to the same partition simultaneously corrupt the partition; the corruption is silent and hard to reverse. Set the limit, then backfill.
Deployment errors
astro deploy fails with image build error
Root cause: Package mismatch, outdated base image, or bad Dockerfile.
Fix:
- Build locally first:
astro dev startsucceeds? - Check
requirements.txtfor conflicting pins. - Check
packages.txtfor apt packages not available in the base image.
DAGs deploy but do not appear in the UI
Root cause: Parse error (most common), or dag folder not refreshing.
Fix:
astro dev run dags list-import-errorsor its production equivalent via the UI.- Fix parse errors; wait for next parse cycle.
- Check
[scheduler] dag_dir_list_interval; if set too high, new DAGs take a while to appear.
Astronomer-specific
Deployment UI shows stale metrics
Root cause: Metrics pipeline delay, not a real problem.
Fix: Refresh; give it a minute. If persistent, check Astronomer status page.
Cannot scale workers up
Root cause: Deployment quota cap.
Fix: Check the deployment's resource cap in the Astro UI; request an increase if needed.
Quick diagnostic commands
# Everything
airflow info # Version info
# Task-specific
airflow tasks states-for-dag-run <dag_id> <run_id>
airflow tasks log <dag_id> <task_id> <date>
# DAG-specific
airflow dags state <dag_id> <date>
airflow dags list-runs --dag-id <dag_id> --state failed
# Parse errors
airflow dags list-import-errors
# Which DAGs are slow to parse
airflow dags report
# Metadata DB health
airflow db check
# Pool utilization
airflow pools list
# Scheduler health
curl $AIRFLOW_HOST/api/v1/health
See also
- Failure triage — the 5-minute procedure.
- Concurrency reference — the knobs this page references.
- Production readiness — items whose absence causes most of these errors.