An Airflow deployment is production-ready when the on-call engineer can step away from it and nothing silently corrupts data in their absence. This checklist is the threshold.

1. Airflow version

See Airflow 3 changes for the migration reality if you are still on 2.x.

2. Platform

Danger

Production Airflow on a single VM with no backups is a data-loss incident waiting to happen. Airflow's metadata DB holds every DAG run history, every variable, every connection. If the instance dies and the DB dies with it, you cannot triage the last 90 days of incidents. Managed platform, managed DB, backups tested.

3. Deployment pipeline

4. DAG hygiene

See DAG authoring standards for per-DAG rules. Readiness requires:

5. Reliability

6. Dependencies

7. Concurrency

See concurrency reference.

8. Secrets

9. Observability

10. Alerting

Warning

Email alerts are the gateway drug to alert fatigue. Every team that starts with "Airflow email on failure" ends up with the channel muted inside six months. Use PagerDuty for the things that should wake someone up, Slack for the things people should see the next morning, and nothing for the things people do not care about.

11. Data quality

12. Runbooks

13. Testing

14. Documentation

15. Security

16. Scale readiness

When the deployment crosses these thresholds, extra attention is warranted:

DAG countAction
50Review parse times (airflow dags report); fix slow parsers.
200Split into multiple workspaces by domain if parse contention is visible.
500Astro Observe or OpenLineage for cross-DAG observability; manual correlation is impractical.
1000Dedicated on-call, documented runbooks per domain, alerting per DAG.

17. The promote-to-prod gate

Before a DAG (or a new workspace) is considered production-ready, a reviewer confirms each section above with a concrete artifact: a PR review note, a CI run link, a screenshot of an alert routing correctly, a runbook link. The gate is the checklist.

Deviations require an RFD and a dated waiver in the workspace README. Waivers expire on a fixed cadence (90 days typical).

Important

The supervisor model is the through-line of every section above. Airflow is most valuable when it is boring: it retries what it should, alerts on what it should, and lets the data engines do the data work. The teams that understand this spend their time building business value on top. The teams that do not spend their time fighting the scheduler.

See also