An Airflow deployment is production-ready when the on-call engineer can step away from it and nothing silently corrupts data in their absence. This checklist is the threshold.
1. Airflow version
- [ ] Airflow 3.x on a supported provider bundle. No 2.x in production.
- [ ] DAG versioning enabled (default in 3.x).
- [ ] Upgrade path for future releases documented; providers pinned.
See Airflow 3 changes for the migration reality if you are still on 2.x.
2. Platform
- [ ] Astronomer (or an equivalent managed platform) for anything critical. No "one EC2 instance running Airflow" in production.
- [ ] Three environments: dev, staging, prod. One Airflow workspace per environment.
- [ ] No dev DAGs in prod workspaces.
- [ ] Triggerer process deployed (required for deferrable operators and AssetWatchers).
- [ ] Metadata DB (Postgres) is managed (RDS, Azure DB, Astronomer-managed) with backups and failover.
Danger
Production Airflow on a single VM with no backups is a data-loss incident waiting to happen. Airflow's metadata DB holds every DAG run history, every variable, every connection. If the instance dies and the DB dies with it, you cannot triage the last 90 days of incidents. Managed platform, managed DB, backups tested.
3. Deployment pipeline
- [ ] Git is the source of truth for DAG code.
- [ ] CI validates every PR: parse check (
airflow dags list), unit tests on callables, ruff + mypy. - [ ] CI deploys to dev on every PR; smoke tests run.
- [ ] Merge to main deploys to staging; integration tests gate prod.
- [ ] Prod deploys triggered by CI (tagged release), not manual
astro deploy. - [ ] OIDC authentication from CI to Astronomer; no long-lived PATs.
- [ ] Every prod deploy is tagged; rollback path documented.
4. DAG hygiene
See DAG authoring standards for per-DAG rules. Readiness requires:
- [ ] All production DAGs idempotent (writes are
MERGE/UPSERT;logical_dateis the partition key). - [ ]
catchup=Falseon every DAG. - [ ]
max_active_runs=1on stateful DAGs. - [ ] pendulum-timezoned
start_date; UTC scheduling for machine work. - [ ] No
datetime.now()in task logic. - [ ] No top-level heavy imports in DAG files.
- [ ] Tags per team, domain, and criticality.
5. Reliability
- [ ] Retries with exponential backoff and
max_retry_delaycap on every task. - [ ]
on_failure_callbackwired to PagerDuty, OpsGenie, or Slack (not email). - [ ] Deferrable sensors only; synchronous sensors banned.
- [ ] Pools for every rate-limited external resource;
max_active_runs_per_dag=1for stateful pipelines. - [ ]
AirflowFailExceptionused for deterministic / poison-message failures.
6. Dependencies
- [ ] Assets (not
ExternalTaskSensor) for DAG-to-DAG coordination. - [ ] AssetWatchers (not polling sensors) for external triggers.
- [ ] Cosmos with
LoadMode.DBT_MANIFESTwhen dbt lives in Airflow. - [ ] Databricks integration via
DatabricksRunNowOperatoragainst DAB-deployed job IDs.
7. Concurrency
- [ ]
parallelismsized for actual worker capacity. - [ ] Pools defined for every shared external resource, slots match real rate limits.
- [ ]
max_active_tasks_per_dagset on DAGs that could burst and monopolize the cluster. - [ ] No synchronous sensors holding worker slots.
8. Secrets
- [ ] Secrets in a backend (AWS Secrets Manager, Azure Key Vault, HashiCorp Vault), not the metadata DB.
- [ ] Connections reference the backend; no literal tokens in DAG code or deployment variables.
- [ ] Astronomer deployment variables marked as secret where applicable; values redacted in UI.
- [ ] Rotation cadence documented (90 days typical for named secrets; OIDC is preferred for CI and removes rotation entirely).
9. Observability
- [ ] Structured JSON logs with
dag_id,task_id,run_id,logical_date, business context. - [ ] Logs shipped to a central platform (Datadog, Grafana Loki, Splunk, Astro Observe).
- [ ] OpenLineage events emitted from DAGs that produce Assets.
- [ ] Metrics exported to Prometheus or equivalent: scheduler heartbeat, task duration, queue depth, pool utilization.
- [ ] Astro Observe (if on Astronomer) enabled for pipeline-level lineage, data quality, and cost attribution.
10. Alerting
- [ ] PagerDuty (or equivalent) for pageable incidents: task failure on a critical DAG, scheduler heartbeat stale, metadata DB unreachable.
- [ ] OpsGenie (or equivalent) for on-call routing.
- [ ] Slack for team-visible signals: DAG failure on a non-critical DAG, retry callbacks.
- [ ] Email alerts not used for paging; summary dashboards only.
- [ ] Alert fatigue review quarterly: which alerts fired, which were actionable, prune the rest.
Warning
Email alerts are the gateway drug to alert fatigue. Every team that starts with "Airflow email on failure" ends up with the channel muted inside six months. Use PagerDuty for the things that should wake someone up, Slack for the things people should see the next morning, and nothing for the things people do not care about.
11. Data quality
- [ ] Row-count assertions at the end of every meaningful write task.
- [ ]
expect_or_failor equivalent on silver pipelines (via LDP, dbt tests, or Astro Observe data quality). - [ ]
AirflowFailExceptionraised on unmet invariants so the pipeline stops instead of publishing bad data. - [ ] Downstream consumers alerted on data-quality events via Assets that signal "bad data" vs "new data".
12. Runbooks
- [ ] Runbook for DAG failure triage (see failure triage guide).
- [ ] Runbook for re-running a single failed task vs full DAG backfill.
- [ ] Runbook for recovering from scheduler stall (heartbeat stale, DB unreachable).
- [ ] Runbook for emergency DAG pause / unpause during an incident.
- [ ] On-call knows where runbooks live without asking.
13. Testing
- [ ] Unit tests for task callables;
astro dev pytestorpytest src/ tests/passes on every PR. - [ ] DAG render tests: every DAG parses without errors in CI.
- [ ] Integration tests on staging before prod deploy, covering the golden path.
- [ ] Local Airflow via
astro dev startmatches prod exactly (same image, same providers).
14. Documentation
- [ ] Every DAG has a description explaining its purpose, upstream sources, and downstream consumers.
- [ ] Every task in a critical DAG has a docstring or comment explaining what it does.
- [ ] Incident runbooks linked from the DAG description where applicable.
- [ ] Workspace README lists the set of DAGs, ownership, and on-call escalation path.
15. Security
- [ ] RBAC enabled (
rbac = Truein OSS; default on Astronomer). - [ ]
expose_config = Falsein the webserver (hides secrets from UI). - [ ] Webserver behind authentication (Entra, Okta, etc.); no open Airflow UI on the internet.
- [ ] Worker pods run as non-root.
- [ ] Container images scanned for CVEs as part of the deploy pipeline.
16. Scale readiness
When the deployment crosses these thresholds, extra attention is warranted:
| DAG count | Action |
|---|---|
| 50 | Review parse times (airflow dags report); fix slow parsers. |
| 200 | Split into multiple workspaces by domain if parse contention is visible. |
| 500 | Astro Observe or OpenLineage for cross-DAG observability; manual correlation is impractical. |
| 1000 | Dedicated on-call, documented runbooks per domain, alerting per DAG. |
17. The promote-to-prod gate
Before a DAG (or a new workspace) is considered production-ready, a reviewer confirms each section above with a concrete artifact: a PR review note, a CI run link, a screenshot of an alert routing correctly, a runbook link. The gate is the checklist.
Deviations require an RFD and a dated waiver in the workspace README. Waivers expire on a fixed cadence (90 days typical).
Important
The supervisor model is the through-line of every section above. Airflow is most valuable when it is boring: it retries what it should, alerts on what it should, and lets the data engines do the data work. The teams that understand this spend their time building business value on top. The teams that do not spend their time fighting the scheduler.
See also
- DAG authoring standards — the per-DAG rules these readiness items rest on.
- The supervisor model — the philosophy that drives every decision here.
- Error recovery — the mechanisms section 5 requires.
- Failure triage — the procedure section 12 requires.