Airflow DAG authoring standards

Standards Airflow

These standards bind every DAG deployed to a Causeway Airflow environment. They are not recommendations. Exceptions require a documented waiver in the PR.

1. DAG shape

Small, focused DAGs. Five 20-task DAGs beat one 100-task DAG.
Each task does one unit of recoverable work. Atomicity: if the task succeeds, the work is done; if it fails, the work is undone or clearly marked as incomplete.
TaskGroups for visual grouping within a DAG that exceeds 10 tasks. Not a substitute for splitting cross-domain DAGs.
Tags on every DAG: team:<team>, domain:<domain>, criticality:<tier>.

2. Idempotency (non-negotiable)

Writes are MERGE, UPSERT, or "delete-partition-then-insert". No blind INSERT.
logical_date is the partition key. Every write scopes to a deterministic window from logical_date.
No datetime.now() or wall-clock time in task logic. It silently breaks backfills.
External side effects carry idempotency keys derived from dag_id + task_id + logical_date.

Danger

datetime.now() in task code is banned outright. A backfill that runs the task for last Tuesday but writes with "today's" date corrupts data in ways that take months to surface. If you need "right now" semantics, use pendulum.now() only at the task boundary (start of execution); never embed in the payload written to storage.

3. `start_date`, schedule, catchup

pendulum-timezoned start_date; never a naive datetime.
Schedule in UTC for machine work; translate to local in reports.
catchup=False, always. Backfills are explicit (airflow dags backfill), not implicit.
max_active_runs=1 for stateful DAGs (any DAG writing to logical_date-scoped partitions).
schedule=None for manually-triggered workflows; do not fake it with a far-future cron.

4. Dependencies

Direct >> for cohesive intra-DAG flow.
Assets for DAG-to-DAG coordination inside Airflow.
AssetWatchers for outside-world triggers.
Never use ExternalTaskSensor for new DAGs; use Assets.
Never use synchronous sensors with deferrable=False; use deferrable variants or AssetWatchers.
Never use TriggerDagRunOperator for new DAGs; use Assets.

See dependency types.

5. Retries

default_args = {
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "retry_exponential_backoff": True,
    "max_retry_delay": timedelta(hours=1),
    "on_failure_callback": alert_pagerduty,
}

Retries on every task, never zero.
Exponential backoff with a cap. No fixed-delay retries.
AirflowFailException for deterministic failures; skip remaining retries.
Callbacks to real alerting (PagerDuty, Slack); email alerts deprecated.

6. Sensors

Deferrable only. *Async operators or deferrable=True. Synchronous sensors banned.
Timeout always set. 2× expected upstream latency; never timeout=None.
poke_interval ≥ 30 s unless the upstream system specifically supports higher frequency.
Prefer AssetWatchers when the trigger is truly external.

7. Pools

Pools for every rate-limited external resource (Salesforce API, Snowflake warehouse, test environments).
Pool name matches the resource (salesforce_api, wh-elt), not the consumer DAG.
Slots match the real rate limit, not a guess.
Created via a seeding DAG or IaC, not ad-hoc in the UI.

8. Task callables

Airflow is a supervisor, not an engine. Heavy work happens in Databricks, dbt, a warehouse, or a dedicated compute system. Not in the Airflow worker.
XCom carries metadata, not data bodies. URIs, counts, watermarks, status. If you are passing MB of data through XCom, push it to object storage and pass the key.
Task callables are short and testable. If the callable is 300 lines, split it.
No top-level work in DAG files. All imports beyond stdlib live inside task callables. API calls, DB queries, file reads at DAG-file top level run every parse, constantly.

Warning

The single most common Airflow performance regression is a DAG file that imports pandas or boto3 at module scope. Every scheduler parse cycle (every 30 seconds by default) pays the import cost. Multiply by 100 DAGs and the scheduler spends most of its time parsing, not scheduling. Imports inside task callables are late-bound to actual task execution and cost nothing at parse time.

9. Error handling

Retries for transient failures. Network blips, 503s, rate-limit 429s.
Pools for rate limits. Do not retry your way through a rate limit.
Sensors for not-ready-yet. Do not retry your way through upstream delays.
Callbacks for alerting. on_failure_callback → PagerDuty; on_sla_miss_callback for SLA-equivalent (now via Astro Observe or equivalent).
AirflowFailException for poison messages. Fail fast on deterministic errors.

See error recovery.

10. Environment

Three environments: dev, staging, prod. One Airflow workspace per environment.
Connections from a secret backend (AWS Secrets Manager, Vault, Astronomer deployment variables). No connections stored in metadata DB with production secrets.
Variables from the same backend where they are sensitive; metadata DB is fine for config values.
Per-environment variable values via deployment variables, not vars: in DAG files.

11. Versioning

DAG versioning enabled (Airflow 3 default). Every run pinned to the DAG code version.
Breaking changes rename the DAG. orders_hourly_v2 alongside orders_hourly; let the old run out its history.
Asset schemas are breaking changes. Consumers must adopt the new schema before producers change it.
Provider versions pinned explicitly; no floating latest.

12. Deployment

astro deploy (or equivalent CI path) is the only production deploy mechanism. No manual UI-driven DAG uploads.
CI runs on every PR: parse check, unit tests on callables, BPA-equivalent (ruff / mypy), DAG render tests.
Deploy to dev on every PR; run smoke tests.
Merge to main deploys to staging; integration tests gate prod deploy.
OIDC authentication from CI to Astro; no long-lived tokens.

13. Observability

Structured JSON logs with dag_id, task_id, run_id, logical_date, and business context.
Logs shipped to a central platform (Datadog, Grafana Loki, Splunk, Astro Observe).
OpenLineage events emitted from DAGs that produce Assets.
Metrics exported to Prometheus or equivalent; scheduler heartbeat, task duration, queue depth.

14. Databricks integration (if applicable)

DatabricksRunNowOperator against DAB-deployed job IDs. Never DatabricksSubmitRunOperator in prod (two sources of truth for job defs).
deferrable=True always; the Databricks job runs for minutes, the worker slot does not hold.
Cosmos with LoadMode.DBT_MANIFEST for dbt. DBT_LS is banned (slow scheduler parse).

See the dbt + Cosmos guide for the full Cosmos config.

15. Unstructured data

File URIs in XCom, not file bodies.
UC Volumes on Databricks for governed storage.
Dynamic task mapping for per-file fan-out; bounded by max_active_tis_per_dagrun.
GPU work offloaded to Databricks GPU clusters, Modal, or Bedrock; Airflow triggers and waits.
File-level idempotency via URI + content hash.

16. Review checklist

PRs touching DAG code must satisfy:

[ ] pendulum-timezoned start_date; catchup=False; max_active_runs=1 for stateful DAGs.
[ ] Retries with exponential backoff and a cap.
[ ] on_failure_callback wired to real alerting.
[ ] Deferrable sensors only; timeouts set.
[ ] Pools referenced for every rate-limited external resource.
[ ] No top-level heavy imports in DAG files.
[ ] Idempotent writes; logical_date as partition key.
[ ] Assets for cross-DAG coordination; no ExternalTaskSensor / TriggerDagRunOperator.
[ ] Tags on the DAG.
[ ] Unit tests for task callables; DAG render tests.
[ ] CI parse check passes.