Lakeflow: Connect, Declarative Pipelines, Jobs

Concepts Databricks

In 2025-2026 Databricks consolidated three previously separate products under the Lakeflow umbrella. The pieces have clean boundaries once you see them, and mushing them together is the source of most "why are we building this ourselves" conversations.

Piece	Previously	Does
Lakeflow Connect	(Fivetran-style)	Managed connectors from SaaS and databases into Unity Catalog
Lakeflow Declarative Pipelines (LDP)	Delta Live Tables	Declarative streaming/batch transformations with quality expectations and CDC
Lakeflow Jobs	Databricks Workflows	Native orchestration of Databricks tasks

Lakeflow Connect

Managed connectors for SaaS and databases: Salesforce, Workday, SQL Server CDC, Google Analytics, and a growing catalog. Think Fivetran-inside-Databricks.

Use Connect when the source is supported and you want the bytes in Unity Catalog without owning ingest infrastructure. Do not build your own Salesforce connector in Python if Connect already speaks to it.

The output lands as Delta tables in a bronze schema of your choice. From there, dbt or LDP take over.

Lakeflow Declarative Pipelines (LDP)

Formerly Delta Live Tables. A declarative framework for streaming and incremental batch transformations with:

Expectations: data quality constraints that warn, drop, or fail a pipeline.
Automatic lineage: every table's upstream and downstream inferred from the pipeline code.
Change data capture: AUTO CDC (replacing APPLY CHANGES INTO) handles SCD-1 and SCD-2 from CDC feeds.
Automatic backfill and retry: the framework owns the plumbing.

The model is SQL or Python:

import dlt

@dlt.table(
    comment="Customer segments, behavioral",
    tblproperties={"quality": "silver"},
)
@dlt.expect_or_drop("valid_customer_id", "customer_id IS NOT NULL")
@dlt.expect("reasonable_ltv", "ltv_usd BETWEEN 0 AND 1000000")
def stg_customer_segments():
    return (
        spark.readStream.table("prod.bronze.customer_segments_raw")
             .select("customer_id", "segment_id", "ltv_usd", "_loaded_at")
    )

When to reach for LDP

Streaming or near-real-time silver tables with quality gates enforced at write time.
CDC ingestion (SCD-1 or SCD-2) from a Delta change feed or a CDC source.
Pipelines where you want Databricks to own the incrementalization rather than coding is_incremental() by hand in dbt.

2026 notes

Three things shifted in the last year that matter:

AUTO CDC has replaced APPLY CHANGES INTO. New pipelines should use it.
Queued execution mode serializes concurrent triggers cleanly; you no longer hit conflict errors from overlapping runs.
Type widening lets you broaden column types without a full pipeline reset.
Pipeline configs can live as Unity Catalog table properties, unifying governance between pipeline metadata and table metadata.

LDP vs. dbt

Both exist on Databricks. They are not competitors.

LDP is right when	dbt is right when
Streaming / near-real-time matters	Batch SQL transformations
You want expectations enforced at write time	Your team lives in SQL and wants macros/packages
CDC is a first-class concern	You need the manifest for Slim CI and Cosmos
Incrementalization should be Databricks' problem	You want full control of the incremental logic

The pattern most mature Causeway teams land on: LDP owns silver streaming tables with quality gates; dbt owns gold marts. See the dbt-on-Databricks quickstart for the dbt half.

Lakeflow Jobs

The native orchestrator. Formerly Databricks Workflows. A Lakeflow Job is one or more tasks with dependencies, schedules, retries, SLAs, and alerts.

When Lakeflow Jobs is the right orchestrator

Everything you orchestrate is on Databricks. No cross-platform DAGs.
You want the best UX for retries and lineage within the platform.
You do not want to pay for a separate orchestration control plane.

When Airflow is the right orchestrator

Your DAG crosses platforms. Databricks after a Snowflake sync, after an SFTP drop, after a Salesforce export, after a dbt-on-Redshift build. Airflow is a cross-platform workflow engine; Lakeflow is a data-first engine.
You already have a mature Airflow estate you do not want to split.

The 2026 canonical pattern

Lakeflow Jobs and Airflow coexist when the graph crosses platforms:

DAB owns the job definition. The job is declared in YAML, deployed by bundle deploy.
Airflow triggers by job ID via DatabricksRunNowOperator. It does not redefine the job.
Both live in Git.

from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator

run = DatabricksRunNowOperator(
    task_id="run_silver_pipeline",
    databricks_conn_id="databricks_default",
    job_id="{{ var.value.silver_pipeline_job_id }}",
)

Warning

DatabricksSubmitRunOperator submits a JSON job spec at trigger time. It exists for one-off cases; avoid it in production. Using it means you have two sources of truth for the job definition: the bundle in Git and the JSON in the Airflow DAG. They will drift.

Putting it together

A typical Causeway pipeline stitches all three:

Salesforce → [Lakeflow Connect] → prod.bronze.sf_accounts
                                              ↓
                          [Lakeflow Declarative Pipeline]
                          stg_accounts (streaming table + expectations)
                          int_accounts_joined (materialized view)
                                              ↓
                                     [dbt]  prod.gold.dim_customers
                                              ↓
                                        [Lakeflow Jobs]
                              schedule → refresh BI dashboards
                                              ↓
                                         [Airflow]
                        sequence with upstream dbt-on-Redshift run

Each tool does the part it is best at. None of them overlaps with another.