These standards bind every DAG deployed to a Causeway Airflow environment. They are not recommendations. Exceptions require a documented waiver in the PR.
1. DAG shape
- Small, focused DAGs. Five 20-task DAGs beat one 100-task DAG.
- Each task does one unit of recoverable work. Atomicity: if the task succeeds, the work is done; if it fails, the work is undone or clearly marked as incomplete.
- TaskGroups for visual grouping within a DAG that exceeds 10 tasks. Not a substitute for splitting cross-domain DAGs.
- Tags on every DAG:
team:<team>,domain:<domain>,criticality:<tier>.
2. Idempotency (non-negotiable)
- Writes are
MERGE,UPSERT, or "delete-partition-then-insert". No blindINSERT. logical_dateis the partition key. Every write scopes to a deterministic window fromlogical_date.- No
datetime.now()or wall-clock time in task logic. It silently breaks backfills. - External side effects carry idempotency keys derived from
dag_id + task_id + logical_date.
Danger
datetime.now() in task code is banned outright. A backfill that runs the task for last Tuesday but writes with "today's" date corrupts data in ways that take months to surface. If you need "right now" semantics, use pendulum.now() only at the task boundary (start of execution); never embed in the payload written to storage.
3. start_date, schedule, catchup
- pendulum-timezoned
start_date; never a naivedatetime. - Schedule in UTC for machine work; translate to local in reports.
catchup=False, always. Backfills are explicit (airflow dags backfill), not implicit.max_active_runs=1for stateful DAGs (any DAG writing tological_date-scoped partitions).schedule=Nonefor manually-triggered workflows; do not fake it with a far-future cron.
4. Dependencies
- Direct
>>for cohesive intra-DAG flow. - Assets for DAG-to-DAG coordination inside Airflow.
- AssetWatchers for outside-world triggers.
- Never use
ExternalTaskSensorfor new DAGs; use Assets. - Never use synchronous sensors with
deferrable=False; use deferrable variants or AssetWatchers. - Never use
TriggerDagRunOperatorfor new DAGs; use Assets.
See dependency types.
5. Retries
default_args = {
"retries": 3,
"retry_delay": timedelta(minutes=5),
"retry_exponential_backoff": True,
"max_retry_delay": timedelta(hours=1),
"on_failure_callback": alert_pagerduty,
}
- Retries on every task, never zero.
- Exponential backoff with a cap. No fixed-delay retries.
AirflowFailExceptionfor deterministic failures; skip remaining retries.- Callbacks to real alerting (PagerDuty, Slack); email alerts deprecated.
6. Sensors
- Deferrable only.
*Asyncoperators ordeferrable=True. Synchronous sensors banned. - Timeout always set. 2× expected upstream latency; never
timeout=None. - poke_interval ≥ 30 s unless the upstream system specifically supports higher frequency.
- Prefer AssetWatchers when the trigger is truly external.
7. Pools
- Pools for every rate-limited external resource (Salesforce API, Snowflake warehouse, test environments).
- Pool name matches the resource (
salesforce_api,wh-elt), not the consumer DAG. - Slots match the real rate limit, not a guess.
- Created via a seeding DAG or IaC, not ad-hoc in the UI.
8. Task callables
- Airflow is a supervisor, not an engine. Heavy work happens in Databricks, dbt, a warehouse, or a dedicated compute system. Not in the Airflow worker.
- XCom carries metadata, not data bodies. URIs, counts, watermarks, status. If you are passing MB of data through XCom, push it to object storage and pass the key.
- Task callables are short and testable. If the callable is 300 lines, split it.
- No top-level work in DAG files. All imports beyond stdlib live inside task callables. API calls, DB queries, file reads at DAG-file top level run every parse, constantly.
Warning
The single most common Airflow performance regression is a DAG file that imports pandas or boto3 at module scope. Every scheduler parse cycle (every 30 seconds by default) pays the import cost. Multiply by 100 DAGs and the scheduler spends most of its time parsing, not scheduling. Imports inside task callables are late-bound to actual task execution and cost nothing at parse time.
9. Error handling
- Retries for transient failures. Network blips, 503s, rate-limit 429s.
- Pools for rate limits. Do not retry your way through a rate limit.
- Sensors for not-ready-yet. Do not retry your way through upstream delays.
- Callbacks for alerting.
on_failure_callback→ PagerDuty;on_sla_miss_callbackfor SLA-equivalent (now via Astro Observe or equivalent). AirflowFailExceptionfor poison messages. Fail fast on deterministic errors.
See error recovery.
10. Environment
- Three environments: dev, staging, prod. One Airflow workspace per environment.
- Connections from a secret backend (AWS Secrets Manager, Vault, Astronomer deployment variables). No connections stored in metadata DB with production secrets.
- Variables from the same backend where they are sensitive; metadata DB is fine for config values.
- Per-environment variable values via deployment variables, not
vars:in DAG files.
11. Versioning
- DAG versioning enabled (Airflow 3 default). Every run pinned to the DAG code version.
- Breaking changes rename the DAG.
orders_hourly_v2alongsideorders_hourly; let the old run out its history. - Asset schemas are breaking changes. Consumers must adopt the new schema before producers change it.
- Provider versions pinned explicitly; no floating
latest.
12. Deployment
astro deploy(or equivalent CI path) is the only production deploy mechanism. No manual UI-driven DAG uploads.- CI runs on every PR: parse check, unit tests on callables, BPA-equivalent (ruff / mypy), DAG render tests.
- Deploy to dev on every PR; run smoke tests.
- Merge to main deploys to staging; integration tests gate prod deploy.
- OIDC authentication from CI to Astro; no long-lived tokens.
13. Observability
- Structured JSON logs with
dag_id,task_id,run_id,logical_date, and business context. - Logs shipped to a central platform (Datadog, Grafana Loki, Splunk, Astro Observe).
- OpenLineage events emitted from DAGs that produce Assets.
- Metrics exported to Prometheus or equivalent; scheduler heartbeat, task duration, queue depth.
14. Databricks integration (if applicable)
DatabricksRunNowOperatoragainst DAB-deployed job IDs. NeverDatabricksSubmitRunOperatorin prod (two sources of truth for job defs).deferrable=Truealways; the Databricks job runs for minutes, the worker slot does not hold.- Cosmos with
LoadMode.DBT_MANIFESTfor dbt.DBT_LSis banned (slow scheduler parse).
See the dbt + Cosmos guide for the full Cosmos config.
15. Unstructured data
- File URIs in XCom, not file bodies.
- UC Volumes on Databricks for governed storage.
- Dynamic task mapping for per-file fan-out; bounded by
max_active_tis_per_dagrun. - GPU work offloaded to Databricks GPU clusters, Modal, or Bedrock; Airflow triggers and waits.
- File-level idempotency via URI + content hash.
16. Review checklist
PRs touching DAG code must satisfy:
- [ ] pendulum-timezoned
start_date;catchup=False;max_active_runs=1for stateful DAGs. - [ ] Retries with exponential backoff and a cap.
- [ ]
on_failure_callbackwired to real alerting. - [ ] Deferrable sensors only; timeouts set.
- [ ] Pools referenced for every rate-limited external resource.
- [ ] No top-level heavy imports in DAG files.
- [ ] Idempotent writes;
logical_dateas partition key. - [ ] Assets for cross-DAG coordination; no
ExternalTaskSensor/TriggerDagRunOperator. - [ ] Tags on the DAG.
- [ ] Unit tests for task callables; DAG render tests.
- [ ] CI parse check passes.
See also
- Production readiness — what a DAG needs before first prod run.
- DAG authoring guide — worked example applying these rules.
- Error recovery — the mechanisms these rules invoke.