An Airflow DAG failed at 3am. You are on call. This guide is the procedure. The first five minutes get you to a classification; the next twenty get you to a resolution.

Step 1: Identify the failure (30 seconds)

Open the Airflow UI → DAGs → the failing DAG → Grid view. Click the red task.

Record:

From the CLI:

airflow tasks states-for-dag-run <dag_id> <run_id>

Step 2: Read the log (2 minutes)

UI → Task Instance → Logs → select try number

Or:

airflow tasks log <dag_id> <task_id> <logical_date> --try-number <n>

Or directly from S3 (if remote logging is on):

aws s3 cp s3://airflow-logs/prod/dag_id=<dag_id>/run_id=<run_id>/task_id=<task_id>/attempt=<n>.log - | less

What to look for:

Note

Not every failure leaves an obvious error. Tasks killed by the kernel (OOM) or the orchestrator (pod eviction) often end abruptly with no Python traceback. If the log just stops mid-output, check the infrastructure event log before the Python log.

Step 3: Classify (60 seconds)

The classification determines the recovery path.

ClassSignalsResponse
InfrastructureOOM, pod eviction, node failure, network timeoutVerify infra health, then retry
Upstream dependencySensor timeout, missing file, upstream DAG failedFix upstream first
Authentication401, 403, Access Denied, token expiredFix credentials, retry
Application logicPython exception, SQL error, schema mismatchFix code, deploy, retry
Resource exhaustionPool full, max active runs, executor slotsWait or raise limits
ConfigurationMissing variable, bad connection string, wrong envFix config, retry
Data qualityRow count off, constraint violation, schema driftInvestigate source
TransientNetwork blip, rate limit, temporary 5xxRetry (usually auto-resolves)

Step 4: Check blast radius (30 seconds)

# How many other tasks in this DAG are affected?
airflow tasks states-for-dag-run <dag_id> <run_id>

# What downstream DAGs depend on this one's Assets?
airflow assets list-consumers <asset_uri>

If more than a handful of downstream DAGs depend on this one, page others before continuing solo.

Step 5: Check system health (1 minute)

Is this an isolated DAG failure or a platform problem?

# Scheduler heartbeat
curl -s http://<airflow-host>/api/v1/health | python -m json.tool

# Scheduler logs
kubectl logs -l component=scheduler --tail=100 --namespace=airflow
# or astro dev logs --scheduler

# Worker health
kubectl get pods -l component=worker --namespace=airflow
# or astro dev logs --webserver

# Metadata DB
airflow db check

Platform-level signals (scheduler heartbeat behind, workers in CrashLoopBackOff, metadata DB slow): the issue is platform-wide, not DAG-specific. Escalate to infrastructure.

Step 6: Pick the recovery path

6a. Infrastructure

Negsignal.SIGKILL or exit code -9 / 137:

kubectl describe pod <pod-name> -n airflow
# Look for: OOMKilled

Fix: raise resources.limits.memory in the task's executor_config or worker deployment. Retry.

Pod eviction:

kubectl get events -n airflow --sort-by='.lastTimestamp' | head -50

Fix: investigate node pressure, move task to a priority class, retry.

6b. Task timeout

airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 1234

Fix:

6c. Sensor timeout

airflow.exceptions.AirflowSensorTimeout

The expected upstream condition did not materialize in the timeout window.

Fix by sensor type:

Warning

Do not bump sensor timeouts as a first response. A 4-hour sensor timeout means 4 hours of silence before anyone knows something is wrong. Fix the upstream; keep the timeout honest.

6d. Upstream dependency failed

The simplest pattern: fix the upstream, then retry the downstream.

# Re-run from the failed task onward
airflow dags backfill <dag_id> --start-date <logical_date> --end-date <logical_date>

Or restart downstream from the UI: select the failed task → Clear → Include Downstream.

6e. Application logic bug

A Python exception or SQL error. The fix is in your code.

  1. Reproduce locally: astro dev run dags test <dag_id>.
  2. Fix the bug.
  3. Deploy: astro deploy (or your CI path).
  4. Retry the failed DAG run.

6f. Data quality issue

Zero rows for window 2026-04-21T03:00:00+00:00

Not a code bug; the source produced bad data.

  1. Query the source for the window.
  2. If the data genuinely is not there (upstream pipeline delayed): wait, then retry.
  3. If the data was corrupted: coordinate the upstream fix.
  4. If the schema changed: fix the DAG to handle the new schema; --full-refresh the affected downstream.

6g. Max retries reached

Task instance has been set to failed, max retries reached

All retries exhausted. The underlying cause persisted.

  1. Read logs for each try number; same error or different?
  2. If same every try, the issue is deterministic; classify as logic / config / permission and handle accordingly.
  3. If different each try, investigate transient causes (intermittent network, rate-limit storms).

Danger

A task that fails three retries in a row with the same error is not a retry problem; it is a root-cause problem. Do not clear-and-retry five more times. Fix the cause, then retry once to verify.

Step 7: Re-run efficiently

# Just the failed task (no upstream or downstream)
airflow tasks run <dag_id> <task_id> <logical_date>

# Re-run the failed task and its downstream
# UI: Clear task → Include Downstream

# Re-run a full date range (backfill)
airflow dags backfill <dag_id> --start-date 2026-04-20 --end-date 2026-04-21

# State-based re-run (tasks in error state from last run)
airflow tasks list <dag_id> --state failed

Step 8: Verify

Before declaring the incident over:

Triage template

Fill this out every time:

Date / time:         _______
DAG ID:              _______
Task ID:             _______
Operator:            [ ] DatabricksRunNow [ ] Python [ ] Sensor [ ] Other
Try number:          ___ of ___
Error class:         [ ] Infra  [ ] Upstream  [ ] Auth  [ ] Logic
                     [ ] Resource  [ ] Config  [ ] Data  [ ] Transient
First-line error:    _______
Blast radius:        ___ downstream DAGs via Asset
Resolution:          [ ] Retry  [ ] Fix code  [ ] Fix infra
                     [ ] Fix upstream  [ ] Fix data  [ ] Escalate
Resolution detail:   _______
Recovery command:    _______

Attach the completed template to the incident record.

See also