Airflow handles success easily; the interesting work is what happens when something fails. This guide is the pragmatic playbook for the four mechanisms you have: retries, pools, sensors, and callbacks.

Start here: the distinction

The most common Airflow reliability bug is muddling these four mechanisms.

MechanismFor
RetriesTransient failures (network blip, 503, timeout)
PoolsGating concurrency against a rate-limited external resource
SensorsWaiting for something to become ready
CallbacksEmitting signals (alerts, audits) on state transitions

Using retries to wait for something to be ready produces flakiness. Using a sensor for a transient failure starves workers. Using a pool for transient errors does nothing useful. Get the frame right.

Retries

Every task should have retries configured. The default (no retries) is the wrong default.

default_args = {
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "retry_exponential_backoff": True,
    "max_retry_delay": timedelta(hours=1),
}

Exponential backoff

Without exponential backoff, three retries fire 15 seconds apart and all fail the same way. With it, retries space out (5 min → 10 min → 20 min), giving the downstream system room to recover.

max_retry_delay caps the backoff. Without a cap, a task that retries five times could wait eight hours between the last two attempts, which usually is not what you want.

When to skip retries: AirflowFailException

from airflow.exceptions import AirflowFailException

@task
def validate(payload: dict):
    if not payload["schema_version"] in {"v1", "v2"}:
        raise AirflowFailException(
            f"Unsupported schema {payload['schema_version']}. "
            "Cannot proceed; not retryable."
        )

AirflowFailException fails the task immediately and skips remaining retries. Use for:

Danger

Do not retry a task twelve times hoping it will eventually work. If the error is deterministic, retries cost time and mask the real signal. AirflowFailException surfaces the problem immediately; the on-call engineer sees the real error instead of "retry 12/12 failed".

Pools

Pools gate concurrency across DAGs, against a shared external resource. They are the right answer for:

from airflow.models import Pool

# Define once (via UI, CLI, or a seeding DAG)
Pool.create_or_update_pool(
    name="salesforce_api",
    slots=5,
    description="Max concurrent Salesforce API calls across all DAGs",
)

# Use per-task
extract = PythonOperator(
    task_id="extract_from_sf",
    python_callable=fetch,
    pool="salesforce_api",
    pool_slots=1,   # default; heavy tasks can claim multiple slots
)

Every task that touches the rate-limited resource references the pool. Airflow queues tasks when no slots are free; it does not reject them.

Pools vs retries

SituationUse
API returned 429 rate limit on this specific callRetry with backoff
API can handle at most N concurrent calls at any timePool with N slots
BothPool + retry; the pool limits concurrent, retries handle the one-off 429s

Sensors

Sensors wait for something to become ready: a file to appear, an external task to finish, a database row to exist. They are the right answer for "not ready yet".

The async rule

The most important rule in 2026:

Warning

Never use a synchronous sensor. The old PokeUntilTrue pattern held a worker slot the entire time it waited. At scale, sensors starved real work. Use deferrable (async) variants: *Async operators or deferrable=True. They free the worker while waiting; the Triggerer process handles the async event loop.

# Wrong: synchronous; holds a worker slot until resolved
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor

wait = S3KeySensor(
    task_id="wait_for_file",
    bucket_key="orders/2026-04-21.parquet",
    bucket_name="landing-zone",
    poke_interval=30,
    timeout=3600,
)

# Right: deferrable; releases worker slot via Triggerer
wait = S3KeySensor(
    task_id="wait_for_file",
    bucket_key="orders/2026-04-21.parquet",
    bucket_name="landing-zone",
    deferrable=True,   # <— the difference
    poke_interval=30,
    timeout=3600,
)

Timeouts always

A sensor with no timeout waits forever. A sensor with a 10-year timeout is effectively the same thing. A reasonable default is 2 × expected upstream latency:

If the timeout fires, the failure is visible; someone investigates; the real problem gets fixed. Without a timeout, the sensor silently hogs a slot indefinitely.

Sensors vs AssetWatchers

When the trigger is truly external (file arriving, event on a queue, table update), prefer an AssetWatcher over a sensor. A watcher subscribes to events; a sensor polls. Watchers are more reliable, cheaper, and lower latency.

Keep sensors for:

Callbacks

Callbacks fire on state transitions. Airflow supports on_failure_callback, on_retry_callback, on_success_callback, and on_skipped_callback.

Failure callbacks: replace email

SLAs were removed in Airflow 3; the replacement is on_failure_callback. Wire it to real alerting (PagerDuty, OpsGenie, Slack), not email.

def alert_on_failure(context):
    from airflow.models import TaskInstance
    ti: TaskInstance = context["ti"]
    send_pagerduty_alert(
        title=f"{ti.dag_id}/{ti.task_id} failed",
        body=(
            f"Run: {ti.run_id}\n"
            f"Logical date: {ti.logical_date}\n"
            f"Log URL: {ti.log_url}"
        ),
        severity="warning",
    )

default_args = {
    "on_failure_callback": alert_on_failure,
    ...
}

Important

Email is the beginning of alert fatigue. Route callbacks to PagerDuty (for pageable incidents), OpsGenie (for on-call routing), and Slack (for team visibility). Email alerts should go to a summary dashboard, not to individual engineers.

Retry callbacks

on_retry_callback fires when a task is about to retry. Useful for signaling "degraded but recovering":

def log_retry(context):
    ti = context["ti"]
    metric("airflow.task.retry", tags={"dag": ti.dag_id, "task": ti.task_id})

Avoid on_success_callback for noisy work

Every successful task fires this callback. It runs inside the worker after the task body. A Slack post on every success produces alert fatigue within a week; reserve on_success_callback for observability (metric emission, lineage events), not notifications.

Concurrency gotchas

Five concurrency knobs interact. Understand all of them before tuning one.

  1. parallelism (cluster-wide) — upper bound for the whole Airflow cluster.
  2. max_active_tasks_per_dag — how many tasks of one DAG can run simultaneously.
  3. max_active_runs_per_dag — how many runs of one DAG can be live in parallel. Set to 1 for stateful DAGs.
  4. max_active_tis_per_dagrun — concurrency of a single task within one run (matters for dynamically mapped tasks).
  5. Pools — cross-DAG gates for external resources.

Tuning in the wrong place is the classic trap. If Salesforce API is rate-limited, that is a pool concern, not a max_active_tasks_per_dag concern. If one DAG with overlapping runs corrupts its writes, that is max_active_runs=1, not a retry configuration.

See concurrency reference for the full matrix.

A decision tree

Walk these when a task fails:

  1. Is the error deterministic (bad input, schema mismatch, missing permission)?AirflowFailException; fix the root cause.
  2. Is the error transient (network blip, 503, timeout)? → retry with exponential backoff.
  3. Is the external resource rate-limited? → pool + retries for the one-offs.
  4. Is the task waiting for something outside Airflow's control? → deferrable sensor with a timeout, or an AssetWatcher.
  5. Did the task succeed but produce bad data? → row-count assertion + AirflowFailException; callbacks emit the signal.

Common mistakes

SymptomRoot cause
Same error every retry, all failRetries used for deterministic failure; should be AirflowFailException.
Worker pool exhausted during a peakSync sensors holding slots; switch to deferrable.
Salesforce rate-limits the whole teamNo pool; every DAG hits the API independently.
Alerts drowning the channelon_success_callback posting to Slack on every task.
Retries fire close together, all 429No retry_exponential_backoff; add it.
Sensor waits foreverNo timeout; set one at ~2x expected upstream latency.

See also