Triage a failed Airflow DAG

How-tos Airflow

An Airflow DAG failed at 3am. You are on call. This guide is the procedure. The first five minutes get you to a classification; the next twenty get you to a resolution.

Step 1: Identify the failure (30 seconds)

Open the Airflow UI → DAGs → the failing DAG → Grid view. Click the red task.

Record:

DAG ID and task ID.
Try number (N of max retries+1).
Execution logical date and run ID.
Duration (how long did it run before failing).
Operator type (Databricks, Python, HTTP, etc.).

From the CLI:

airflow tasks states-for-dag-run <dag_id> <run_id>

Step 2: Read the log (2 minutes)

UI → Task Instance → Logs → select try number

Or:

airflow tasks log <dag_id> <task_id> <logical_date> --try-number <n>

Or directly from S3 (if remote logging is on):

aws s3 cp s3://airflow-logs/prod/dag_id=<dag_id>/run_id=<run_id>/task_id=<task_id>/attempt=<n>.log - | less

What to look for:

Last 50 lines before failure — the actual error.
Stack traces or exception messages.
Timeout indicators (AirflowTaskTimeout, AirflowSensorTimeout).
Non-zero exit codes and Negsignal.SIGKILL.
Connection errors or auth failures.

Note

Not every failure leaves an obvious error. Tasks killed by the kernel (OOM) or the orchestrator (pod eviction) often end abruptly with no Python traceback. If the log just stops mid-output, check the infrastructure event log before the Python log.

Step 3: Classify (60 seconds)

The classification determines the recovery path.

Class	Signals	Response
Infrastructure	OOM, pod eviction, node failure, network timeout	Verify infra health, then retry
Upstream dependency	Sensor timeout, missing file, upstream DAG failed	Fix upstream first
Authentication	401, 403, Access Denied, token expired	Fix credentials, retry
Application logic	Python exception, SQL error, schema mismatch	Fix code, deploy, retry
Resource exhaustion	Pool full, max active runs, executor slots	Wait or raise limits
Configuration	Missing variable, bad connection string, wrong env	Fix config, retry
Data quality	Row count off, constraint violation, schema drift	Investigate source
Transient	Network blip, rate limit, temporary 5xx	Retry (usually auto-resolves)

Step 4: Check blast radius (30 seconds)

# How many other tasks in this DAG are affected?
airflow tasks states-for-dag-run <dag_id> <run_id>

# What downstream DAGs depend on this one's Assets?
airflow assets list-consumers <asset_uri>

If more than a handful of downstream DAGs depend on this one, page others before continuing solo.

Step 5: Check system health (1 minute)

Is this an isolated DAG failure or a platform problem?

# Scheduler heartbeat
curl -s http://<airflow-host>/api/v1/health | python -m json.tool

# Scheduler logs
kubectl logs -l component=scheduler --tail=100 --namespace=airflow
# or astro dev logs --scheduler

# Worker health
kubectl get pods -l component=worker --namespace=airflow
# or astro dev logs --webserver

# Metadata DB
airflow db check

Platform-level signals (scheduler heartbeat behind, workers in CrashLoopBackOff, metadata DB slow): the issue is platform-wide, not DAG-specific. Escalate to infrastructure.

Step 6: Pick the recovery path

6a. Infrastructure

Negsignal.SIGKILL or exit code -9 / 137:

kubectl describe pod <pod-name> -n airflow
# Look for: OOMKilled

Fix: raise resources.limits.memory in the task's executor_config or worker deployment. Retry.

Pod eviction:

kubectl get events -n airflow --sort-by='.lastTimestamp' | head -50

Fix: investigate node pressure, move task to a priority class, retry.

6b. Task timeout

airflow.exceptions.AirflowTaskTimeout: Timeout, PID: 1234

Fix:

If the task is genuinely slow due to data growth, optimize the query or raise execution_timeout.
If the task is stuck on a downstream resource (DB lock, unresponsive API), fix the downstream.

6c. Sensor timeout

airflow.exceptions.AirflowSensorTimeout

The expected upstream condition did not materialize in the timeout window.

Fix by sensor type:

S3KeySensor: verify the S3 object was actually written; check the upstream job.
ExternalTaskSensor: verify the upstream DAG completed; check execution_delta.
HttpSensor: verify the API responded with the expected check.

Warning

Do not bump sensor timeouts as a first response. A 4-hour sensor timeout means 4 hours of silence before anyone knows something is wrong. Fix the upstream; keep the timeout honest.

6d. Upstream dependency failed

The simplest pattern: fix the upstream, then retry the downstream.

# Re-run from the failed task onward
airflow dags backfill <dag_id> --start-date <logical_date> --end-date <logical_date>

Or restart downstream from the UI: select the failed task → Clear → Include Downstream.

6e. Application logic bug

A Python exception or SQL error. The fix is in your code.

Reproduce locally: astro dev run dags test <dag_id>.
Fix the bug.
Deploy: astro deploy (or your CI path).
Retry the failed DAG run.

6f. Data quality issue

Zero rows for window 2026-04-21T03:00:00+00:00

Not a code bug; the source produced bad data.

Query the source for the window.
If the data genuinely is not there (upstream pipeline delayed): wait, then retry.
If the data was corrupted: coordinate the upstream fix.
If the schema changed: fix the DAG to handle the new schema; --full-refresh the affected downstream.

6g. Max retries reached

Task instance has been set to failed, max retries reached

All retries exhausted. The underlying cause persisted.

Read logs for each try number; same error or different?
If same every try, the issue is deterministic; classify as logic / config / permission and handle accordingly.
If different each try, investigate transient causes (intermittent network, rate-limit storms).

Danger

A task that fails three retries in a row with the same error is not a retry problem; it is a root-cause problem. Do not clear-and-retry five more times. Fix the cause, then retry once to verify.

Step 7: Re-run efficiently

# Just the failed task (no upstream or downstream)
airflow tasks run <dag_id> <task_id> <logical_date>

# Re-run the failed task and its downstream
# UI: Clear task → Include Downstream

# Re-run a full date range (backfill)
airflow dags backfill <dag_id> --start-date 2026-04-20 --end-date 2026-04-21

# State-based re-run (tasks in error state from last run)
airflow tasks list <dag_id> --state failed

Step 8: Verify

Before declaring the incident over:

[ ] Previously failed tasks now show success for the affected run.
[ ] Downstream DAGs triggered by Asset updates are on schedule.
[ ] Row counts are reasonable on the destination tables.
[ ] Monitoring dashboards reflect current data.
[ ] Consumers notified of any delay.

Triage template

Fill this out every time:

Date / time:         _______
DAG ID:              _______
Task ID:             _______
Operator:            [ ] DatabricksRunNow [ ] Python [ ] Sensor [ ] Other
Try number:          ___ of ___
Error class:         [ ] Infra  [ ] Upstream  [ ] Auth  [ ] Logic
                     [ ] Resource  [ ] Config  [ ] Data  [ ] Transient
First-line error:    _______
Blast radius:        ___ downstream DAGs via Asset
Resolution:          [ ] Retry  [ ] Fix code  [ ] Fix infra
                     [ ] Fix upstream  [ ] Fix data  [ ] Escalate
Resolution detail:   _______
Recovery command:    _______

Attach the completed template to the incident record.