Airflow is a supervisor, not an engine

Concepts Airflow

The most expensive mistake teams make with Airflow is using it as a compute engine. Pandas transforms inside PythonOperator. Business logic in task callables. 5 GB data bodies flying through XCom. At scale, every one of those choices becomes an operational tax.

Airflow's job is to supervise work that happens elsewhere. The work itself runs on the system built to run it.

The five things a supervisor does

Decides when a unit of work should run — schedule, event, asset update, or manual trigger.
Triggers the right external system — Databricks, dbt, an API, a Lambda.
Tracks success, failure, and retry state — the metadata database is the authoritative source of "did this run".
Coordinates dependencies across systems — DAG A, then B, then C, where each runs on different compute.
Emits metadata downstream — lineage, run results, SLA signals, observability events.

That is the whole surface. Everything else should happen elsewhere.

The engine anti-pattern

Compare two implementations of the same pipeline.

The anti-pattern: work inside the worker

@task
def transform_orders():
    import pandas as pd
    df = pd.read_parquet("s3://bucket/orders_raw.parquet")   # 5 GB of data
    df["revenue"] = df["amount"] * df["fx_rate"]
    df["is_active"] = df["status"] == "active"
    df.to_parquet("s3://bucket/orders_enriched.parquet")

What is wrong:

The worker spends 45 minutes holding 5 GB of pandas DataFrame in memory.
The worker slot is unavailable for other tasks the whole time.
If the worker dies mid-task, the whole 45 minutes of work is lost.
Autoscaling the worker pool (vertical memory) to fit the largest pandas transform means paying for that memory on every tiny task.
Logging, monitoring, and retries for this transform all live in Airflow instead of in the system that should own them.

The right pattern: supervise, don't execute

from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator

transform = DatabricksRunNowOperator(
    task_id="transform_orders",
    databricks_conn_id="databricks_default",
    job_id="{{ var.value.transform_orders_job_id }}",
    deferrable=True,   # release the worker slot while Databricks runs
)

The 5 GB transform runs on a Databricks cluster sized for it. Airflow holds no memory. The task is deferrable, so the worker slot is free for other work while Databricks does the heavy lifting. If the task fails, Airflow retries; if the Databricks job fails, Databricks's own retry logic kicks in first.

Important

This distinction drives every other decision in Airflow. Deferrable operators, pools, Assets, concurrency knobs, retry backoff: all of them make sense only once you accept that Airflow is supervising work, not doing it. Teams that treat Airflow as a compute engine spend the bulk of their time fighting the scheduler. Teams that treat it as a supervisor spend their time building pipelines.

The rule of thumb

If a task takes 45 minutes and processes 5 GB of data inside the worker, you have built an ETL engine out of a scheduler. Stop; find the right engine.

Right engines, common cases:

What the task does	Right engine
Transforms on lakehouse data	Databricks job (via `DatabricksRunNowOperator`)
SQL on a warehouse	Warehouse job or dbt model, triggered from Airflow
Python batch on small data	Still Airflow, but tight and short
ML training	Dedicated GPU compute (Modal, SageMaker, Databricks GPU cluster)
API call with retries	Airflow, but deferrable with an async operator
File processing at scale	Databricks, Glue, or a Lambda fleet; Airflow triggers and waits

The "what goes in Airflow" test

Three questions for any logic you are considering putting in a DAG file or task callable:

Is this decision logic (when to run, what to run, what came before)? → Airflow.
Is this metadata emission (writing lineage, logging business signal, alerting)? → Airflow.
Is this computational work on non-trivial data (joins, aggregations, transforms)? → Not Airflow. Trigger the system that owns that work and wait on completion.

When in doubt, err on pushing work out. Airflow's value is consistency as an orchestrator. Taking on compute responsibilities erodes that consistency.

A legitimate small exception

Not every piece of work deserves its own cluster. A 30-second Python script that touches a few hundred rows and writes a status record is fine inside a task. The rule is "one unit of recoverable work per task," not "never touch data in Python".

The heuristic: if the task runtime dominates the scheduler orchestration overhead, find the right engine. If the task is ten seconds of trivial glue, keep it in Airflow.

What this means for DAG design

Once you internalize the supervisor model, DAG design becomes mostly shape:

Each task triggers one external action and waits for it.
Dependencies express order, not data flow. XComs carry metadata (run IDs, watermarks, status), not data bodies.
Retries are transient-failure handling, not "wait longer for the data to appear". That is what sensors are for.
Pools gate concurrency against external resources (rate-limited APIs, warehouse queue slots).
Callbacks emit signals downstream (PagerDuty, Slack, monitoring) on failure.

Every other section of the Airflow documentation makes more sense with this frame in place.