An Airflow DAG failed at 3am. You are on call. This guide is the procedure. The first five minutes get you to a classification; the next twenty get you to a resolution.
Step 1: Identify the failure (30 seconds)
Open the Airflow UI → DAGs → the failing DAG → Grid view. Click the red task.
Record:
- DAG ID and task ID.
- Try number (N of max retries+1).
- Execution logical date and run ID.
- Duration (how long did it run before failing).
- Operator type (Databricks, Python, HTTP, etc.).
From the CLI:
airflow tasks states-for-dag-run <dag_id> <run_id>
Step 2: Read the log (2 minutes)
UI → Task Instance → Logs → select try number
Or:
airflow tasks log <dag_id> <task_id> <logical_date> --try-number <n>
Or directly from S3 (if remote logging is on):
aws s3 cp s3://airflow-logs/prod/dag_id=<dag_id>/run_id=<run_id>/task_id=<task_id>/attempt=<n>.log - | less
What to look for:
- Last 50 lines before failure — the actual error.
- Stack traces or exception messages.
- Timeout indicators (
AirflowTaskTimeout,AirflowSensorTimeout). - Non-zero exit codes and
Negsignal.SIGKILL. - Connection errors or auth failures.
Note
Not every failure leaves an obvious error. Tasks killed by the kernel (OOM) or the orchestrator (pod eviction) often end abruptly with no Python traceback. If the log just stops mid-output, check the infrastructure event log before the Python log.
Step 3: Classify (60 seconds)
The classification determines the recovery path.
| Class | Signals | Response |
|---|---|---|
| Infrastructure | OOM, pod eviction, node failure, network timeout | Verify infra health, then retry |
| Upstream dependency | Sensor timeout, missing file, upstream DAG failed | Fix upstream first |
| Authentication | 401, 403, Access Denied, token expired | Fix credentials, retry |
| Application logic | Python exception, SQL error, schema mismatch | Fix code, deploy, retry |
| Resource exhaustion | Pool full, max active runs, executor slots | Wait or raise limits |
| Configuration | Missing variable, bad connection string, wrong env | Fix config, retry |
| Data quality | Row count off, constraint violation, schema drift | Investigate source |
| Transient | Network blip, rate limit, temporary 5xx | Retry (usually auto-resolves) |
Step 4: Check blast radius (30 seconds)
# How many other tasks in this DAG are affected?
airflow tasks states-for-dag-run <dag_id> <run_id>
# What downstream DAGs depend on this one's Assets?
airflow assets list-consumers <asset_uri>
If more than a handful of downstream DAGs depend on this one, page others before continuing solo.
Step 5: Check system health (1 minute)
Is this an isolated DAG failure or a platform problem?
# Scheduler heartbeat
curl -s http://<airflow-host>/api/v1/health | python -m json.tool
# Scheduler logs
kubectl logs -l component=scheduler --tail=100 --namespace=airflow
# or astro dev logs --scheduler
# Worker health
kubectl get pods -l component=worker --namespace=airflow
# or astro dev logs --webserver
# Metadata DB
airflow db check
Platform-level signals (scheduler heartbeat behind, workers in CrashLoopBackOff, metadata DB slow): the issue is platform-wide, not DAG-specific. Escalate to infrastructure.
Step 6: Pick the recovery path
6a. Infrastructure
Negsignal.SIGKILL or exit code -9 / 137:
kubectl describe pod <pod-name> -n airflow
# Look for: OOMKilled
Fix: raise resources.limits.memory in the task's executor_config or worker deployment. Retry.
Pod eviction:
kubectl get events -n airflow --sort-by='.lastTimestamp' | head -50
Fix: investigate node pressure, move task to a priority class, retry.
6b. Task timeout
airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 1234
Fix:
- If the task is genuinely slow due to data growth, optimize the query or raise
execution_timeout. - If the task is stuck on a downstream resource (DB lock, unresponsive API), fix the downstream.
6c. Sensor timeout
airflow.exceptions.AirflowSensorTimeout
The expected upstream condition did not materialize in the timeout window.
Fix by sensor type:
- S3KeySensor: verify the S3 object was actually written; check the upstream job.
- ExternalTaskSensor: verify the upstream DAG completed; check
execution_delta. - HttpSensor: verify the API responded with the expected check.
Warning
Do not bump sensor timeouts as a first response. A 4-hour sensor timeout means 4 hours of silence before anyone knows something is wrong. Fix the upstream; keep the timeout honest.
6d. Upstream dependency failed
The simplest pattern: fix the upstream, then retry the downstream.
# Re-run from the failed task onward
airflow dags backfill <dag_id> --start-date <logical_date> --end-date <logical_date>
Or restart downstream from the UI: select the failed task → Clear → Include Downstream.
6e. Application logic bug
A Python exception or SQL error. The fix is in your code.
- Reproduce locally:
astro dev run dags test <dag_id>. - Fix the bug.
- Deploy:
astro deploy(or your CI path). - Retry the failed DAG run.
6f. Data quality issue
Zero rows for window 2026-04-21T03:00:00+00:00
Not a code bug; the source produced bad data.
- Query the source for the window.
- If the data genuinely is not there (upstream pipeline delayed): wait, then retry.
- If the data was corrupted: coordinate the upstream fix.
- If the schema changed: fix the DAG to handle the new schema;
--full-refreshthe affected downstream.
6g. Max retries reached
Task instance has been set to failed, max retries reached
All retries exhausted. The underlying cause persisted.
- Read logs for each try number; same error or different?
- If same every try, the issue is deterministic; classify as logic / config / permission and handle accordingly.
- If different each try, investigate transient causes (intermittent network, rate-limit storms).
Danger
A task that fails three retries in a row with the same error is not a retry problem; it is a root-cause problem. Do not clear-and-retry five more times. Fix the cause, then retry once to verify.
Step 7: Re-run efficiently
# Just the failed task (no upstream or downstream)
airflow tasks run <dag_id> <task_id> <logical_date>
# Re-run the failed task and its downstream
# UI: Clear task → Include Downstream
# Re-run a full date range (backfill)
airflow dags backfill <dag_id> --start-date 2026-04-20 --end-date 2026-04-21
# State-based re-run (tasks in error state from last run)
airflow tasks list <dag_id> --state failed
Step 8: Verify
Before declaring the incident over:
- [ ] Previously failed tasks now show
successfor the affected run. - [ ] Downstream DAGs triggered by Asset updates are on schedule.
- [ ] Row counts are reasonable on the destination tables.
- [ ] Monitoring dashboards reflect current data.
- [ ] Consumers notified of any delay.
Triage template
Fill this out every time:
Date / time: _______
DAG ID: _______
Task ID: _______
Operator: [ ] DatabricksRunNow [ ] Python [ ] Sensor [ ] Other
Try number: ___ of ___
Error class: [ ] Infra [ ] Upstream [ ] Auth [ ] Logic
[ ] Resource [ ] Config [ ] Data [ ] Transient
First-line error: _______
Blast radius: ___ downstream DAGs via Asset
Resolution: [ ] Retry [ ] Fix code [ ] Fix infra
[ ] Fix upstream [ ] Fix data [ ] Escalate
Resolution detail: _______
Recovery command: _______
Attach the completed template to the incident record.
See also
- DAG authoring guide — patterns that reduce the frequency of these triages.
- Error recovery guide — the retry / pool / sensor mechanisms.
- Common errors reference — symptom-first lookup.