Airflow production readiness checklist

Standards Airflow

An Airflow deployment is production-ready when the on-call engineer can step away from it and nothing silently corrupts data in their absence. This checklist is the threshold.

1. Airflow version

[ ] Airflow 3.x on a supported provider bundle. No 2.x in production.
[ ] DAG versioning enabled (default in 3.x).
[ ] Upgrade path for future releases documented; providers pinned.

See Airflow 3 changes for the migration reality if you are still on 2.x.

2. Platform

[ ] Astronomer (or an equivalent managed platform) for anything critical. No "one EC2 instance running Airflow" in production.
[ ] Three environments: dev, staging, prod. One Airflow workspace per environment.
[ ] No dev DAGs in prod workspaces.
[ ] Triggerer process deployed (required for deferrable operators and AssetWatchers).
[ ] Metadata DB (Postgres) is managed (RDS, Azure DB, Astronomer-managed) with backups and failover.

Danger

Production Airflow on a single VM with no backups is a data-loss incident waiting to happen. Airflow's metadata DB holds every DAG run history, every variable, every connection. If the instance dies and the DB dies with it, you cannot triage the last 90 days of incidents. Managed platform, managed DB, backups tested.

3. Deployment pipeline

[ ] Git is the source of truth for DAG code.
[ ] CI validates every PR: parse check (airflow dags list), unit tests on callables, ruff + mypy.
[ ] CI deploys to dev on every PR; smoke tests run.
[ ] Merge to main deploys to staging; integration tests gate prod.
[ ] Prod deploys triggered by CI (tagged release), not manual astro deploy.
[ ] OIDC authentication from CI to Astronomer; no long-lived PATs.
[ ] Every prod deploy is tagged; rollback path documented.

4. DAG hygiene

See DAG authoring standards for per-DAG rules. Readiness requires:

[ ] All production DAGs idempotent (writes are MERGE / UPSERT; logical_date is the partition key).
[ ] catchup=False on every DAG.
[ ] max_active_runs=1 on stateful DAGs.
[ ] pendulum-timezoned start_date; UTC scheduling for machine work.
[ ] No datetime.now() in task logic.
[ ] No top-level heavy imports in DAG files.
[ ] Tags per team, domain, and criticality.

5. Reliability

[ ] Retries with exponential backoff and max_retry_delay cap on every task.
[ ] on_failure_callback wired to PagerDuty, OpsGenie, or Slack (not email).
[ ] Deferrable sensors only; synchronous sensors banned.
[ ] Pools for every rate-limited external resource; max_active_runs_per_dag=1 for stateful pipelines.
[ ] AirflowFailException used for deterministic / poison-message failures.

6. Dependencies

[ ] Assets (not ExternalTaskSensor) for DAG-to-DAG coordination.
[ ] AssetWatchers (not polling sensors) for external triggers.
[ ] Cosmos with LoadMode.DBT_MANIFEST when dbt lives in Airflow.
[ ] Databricks integration via DatabricksRunNowOperator against DAB-deployed job IDs.

7. Concurrency

[ ] parallelism sized for actual worker capacity.
[ ] Pools defined for every shared external resource, slots match real rate limits.
[ ] max_active_tasks_per_dag set on DAGs that could burst and monopolize the cluster.
[ ] No synchronous sensors holding worker slots.

See concurrency reference.

8. Secrets

[ ] Secrets in a backend (AWS Secrets Manager, Azure Key Vault, HashiCorp Vault), not the metadata DB.
[ ] Connections reference the backend; no literal tokens in DAG code or deployment variables.
[ ] Astronomer deployment variables marked as secret where applicable; values redacted in UI.
[ ] Rotation cadence documented (90 days typical for named secrets; OIDC is preferred for CI and removes rotation entirely).

9. Observability

[ ] Structured JSON logs with dag_id, task_id, run_id, logical_date, business context.
[ ] Logs shipped to a central platform (Datadog, Grafana Loki, Splunk, Astro Observe).
[ ] OpenLineage events emitted from DAGs that produce Assets.
[ ] Metrics exported to Prometheus or equivalent: scheduler heartbeat, task duration, queue depth, pool utilization.
[ ] Astro Observe (if on Astronomer) enabled for pipeline-level lineage, data quality, and cost attribution.

10. Alerting

[ ] PagerDuty (or equivalent) for pageable incidents: task failure on a critical DAG, scheduler heartbeat stale, metadata DB unreachable.
[ ] OpsGenie (or equivalent) for on-call routing.
[ ] Slack for team-visible signals: DAG failure on a non-critical DAG, retry callbacks.
[ ] Email alerts not used for paging; summary dashboards only.
[ ] Alert fatigue review quarterly: which alerts fired, which were actionable, prune the rest.

Warning

Email alerts are the gateway drug to alert fatigue. Every team that starts with "Airflow email on failure" ends up with the channel muted inside six months. Use PagerDuty for the things that should wake someone up, Slack for the things people should see the next morning, and nothing for the things people do not care about.

11. Data quality

[ ] Row-count assertions at the end of every meaningful write task.
[ ] expect_or_fail or equivalent on silver pipelines (via LDP, dbt tests, or Astro Observe data quality).
[ ] AirflowFailException raised on unmet invariants so the pipeline stops instead of publishing bad data.
[ ] Downstream consumers alerted on data-quality events via Assets that signal "bad data" vs "new data".

12. Runbooks

[ ] Runbook for DAG failure triage (see failure triage guide).
[ ] Runbook for re-running a single failed task vs full DAG backfill.
[ ] Runbook for recovering from scheduler stall (heartbeat stale, DB unreachable).
[ ] Runbook for emergency DAG pause / unpause during an incident.
[ ] On-call knows where runbooks live without asking.

13. Testing

[ ] Unit tests for task callables; astro dev pytest or pytest src/ tests/ passes on every PR.
[ ] DAG render tests: every DAG parses without errors in CI.
[ ] Integration tests on staging before prod deploy, covering the golden path.
[ ] Local Airflow via astro dev start matches prod exactly (same image, same providers).

14. Documentation

[ ] Every DAG has a description explaining its purpose, upstream sources, and downstream consumers.
[ ] Every task in a critical DAG has a docstring or comment explaining what it does.
[ ] Incident runbooks linked from the DAG description where applicable.
[ ] Workspace README lists the set of DAGs, ownership, and on-call escalation path.

15. Security

[ ] RBAC enabled (rbac = True in OSS; default on Astronomer).
[ ] expose_config = False in the webserver (hides secrets from UI).
[ ] Webserver behind authentication (Entra, Okta, etc.); no open Airflow UI on the internet.
[ ] Worker pods run as non-root.
[ ] Container images scanned for CVEs as part of the deploy pipeline.

16. Scale readiness

When the deployment crosses these thresholds, extra attention is warranted:

DAG count	Action
50	Review parse times (`airflow dags report`); fix slow parsers.
200	Split into multiple workspaces by domain if parse contention is visible.
500	Astro Observe or OpenLineage for cross-DAG observability; manual correlation is impractical.
1000	Dedicated on-call, documented runbooks per domain, alerting per DAG.

17. The promote-to-prod gate

Before a DAG (or a new workspace) is considered production-ready, a reviewer confirms each section above with a concrete artifact: a PR review note, a CI run link, a screenshot of an alert routing correctly, a runbook link. The gate is the checklist.

Deviations require an RFD and a dated waiver in the workspace README. Waivers expire on a fixed cadence (90 days typical).

Important

The supervisor model is the through-line of every section above. Airflow is most valuable when it is boring: it retries what it should, alerts on what it should, and lets the data engines do the data work. The teams that understand this spend their time building business value on top. The teams that do not spend their time fighting the scheduler.