The most expensive mistake teams make with Airflow is using it as a compute engine. Pandas transforms inside PythonOperator. Business logic in task callables. 5 GB data bodies flying through XCom. At scale, every one of those choices becomes an operational tax.
Airflow's job is to supervise work that happens elsewhere. The work itself runs on the system built to run it.
The five things a supervisor does
- Decides when a unit of work should run — schedule, event, asset update, or manual trigger.
- Triggers the right external system — Databricks, dbt, an API, a Lambda.
- Tracks success, failure, and retry state — the metadata database is the authoritative source of "did this run".
- Coordinates dependencies across systems — DAG A, then B, then C, where each runs on different compute.
- Emits metadata downstream — lineage, run results, SLA signals, observability events.
That is the whole surface. Everything else should happen elsewhere.
The engine anti-pattern
Compare two implementations of the same pipeline.
The anti-pattern: work inside the worker
@task
def transform_orders():
import pandas as pd
df = pd.read_parquet("s3://bucket/orders_raw.parquet") # 5 GB of data
df["revenue"] = df["amount"] * df["fx_rate"]
df["is_active"] = df["status"] == "active"
df.to_parquet("s3://bucket/orders_enriched.parquet")
What is wrong:
- The worker spends 45 minutes holding 5 GB of pandas DataFrame in memory.
- The worker slot is unavailable for other tasks the whole time.
- If the worker dies mid-task, the whole 45 minutes of work is lost.
- Autoscaling the worker pool (vertical memory) to fit the largest pandas transform means paying for that memory on every tiny task.
- Logging, monitoring, and retries for this transform all live in Airflow instead of in the system that should own them.
The right pattern: supervise, don't execute
from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator
transform = DatabricksRunNowOperator(
task_id="transform_orders",
databricks_conn_id="databricks_default",
job_id="{{ var.value.transform_orders_job_id }}",
deferrable=True, # release the worker slot while Databricks runs
)
The 5 GB transform runs on a Databricks cluster sized for it. Airflow holds no memory. The task is deferrable, so the worker slot is free for other work while Databricks does the heavy lifting. If the task fails, Airflow retries; if the Databricks job fails, Databricks's own retry logic kicks in first.
Important
This distinction drives every other decision in Airflow. Deferrable operators, pools, Assets, concurrency knobs, retry backoff: all of them make sense only once you accept that Airflow is supervising work, not doing it. Teams that treat Airflow as a compute engine spend the bulk of their time fighting the scheduler. Teams that treat it as a supervisor spend their time building pipelines.
The rule of thumb
If a task takes 45 minutes and processes 5 GB of data inside the worker, you have built an ETL engine out of a scheduler. Stop; find the right engine.
Right engines, common cases:
| What the task does | Right engine |
|---|---|
| Transforms on lakehouse data | Databricks job (via DatabricksRunNowOperator) |
| SQL on a warehouse | Warehouse job or dbt model, triggered from Airflow |
| Python batch on small data | Still Airflow, but tight and short |
| ML training | Dedicated GPU compute (Modal, SageMaker, Databricks GPU cluster) |
| API call with retries | Airflow, but deferrable with an async operator |
| File processing at scale | Databricks, Glue, or a Lambda fleet; Airflow triggers and waits |
The "what goes in Airflow" test
Three questions for any logic you are considering putting in a DAG file or task callable:
- Is this decision logic (when to run, what to run, what came before)? → Airflow.
- Is this metadata emission (writing lineage, logging business signal, alerting)? → Airflow.
- Is this computational work on non-trivial data (joins, aggregations, transforms)? → Not Airflow. Trigger the system that owns that work and wait on completion.
When in doubt, err on pushing work out. Airflow's value is consistency as an orchestrator. Taking on compute responsibilities erodes that consistency.
A legitimate small exception
Not every piece of work deserves its own cluster. A 30-second Python script that touches a few hundred rows and writes a status record is fine inside a task. The rule is "one unit of recoverable work per task," not "never touch data in Python".
The heuristic: if the task runtime dominates the scheduler orchestration overhead, find the right engine. If the task is ten seconds of trivial glue, keep it in Airflow.
What this means for DAG design
Once you internalize the supervisor model, DAG design becomes mostly shape:
- Each task triggers one external action and waits for it.
- Dependencies express order, not data flow. XComs carry metadata (run IDs, watermarks, status), not data bodies.
- Retries are transient-failure handling, not "wait longer for the data to appear". That is what sensors are for.
- Pools gate concurrency against external resources (rate-limited APIs, warehouse queue slots).
- Callbacks emit signals downstream (PagerDuty, Slack, monitoring) on failure.
Every other section of the Airflow documentation makes more sense with this frame in place.
See also
- DAG authoring guide — the rules that follow from the supervisor model.
- Dependency types — how supervisors coordinate across systems.
- DAG authoring standards — what PR review checks.