Materializations: when to use which

Concepts dbt

A materialization is how dbt physically persists your model. dbt-databricks supports five: view, table, incremental, materialized_view, and streaming_table. Each has a sweet spot; mismatches waste money and confuse debugging.

The five

Materialization	What it is	Cost profile
`view`	A SQL view; computed on read	Zero storage; query-time cost
`table`	A full table rebuild every run	Write cost every run; cheap reads
`incremental`	Merge/append/overwrite of new rows only	Low write per run after first; complex
`materialized_view`	Databricks-managed MV (delegated to DLT)	Databricks decides refresh cadence
`streaming_table`	Reads a Delta change feed or stream	Continuous; latency-over-cost
`ephemeral`	Not a materialization; a CTE inlined into callers	No storage; inflates downstream SQL

Decision framework

Walk the questions in order. Stop at the first "yes".

Is the model 1:1 with a source table and only renames/casts? → view (staging default).
Is the model a small reusable building block not exposed downstream? → ephemeral or view (intermediate default).
Does the full-refresh cost hurt and the source has a reliable watermark? → incremental.
Is the model an aggregation of a large source that Databricks could maintain for you? → materialized_view.
Does latency matter more than cost, and is there a streaming source? → streaming_table.
Otherwise → table.

Defaults per layer

# dbt_project.yml
models:
  my_project:
    staging:
      +materialized: view
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table

Override per-model with a config() block when the rule of thumb does not fit.

`view` — the default you should reach for

Views cost nothing to maintain. They always reflect the latest source data. They are the right default for staging and most intermediates.

Views stop being right when:

The SQL is expensive and gets run many times per day (push to table).
The source is enormous and every query scans the whole thing (push to incremental if there is a watermark, or to materialized_view if it is an aggregation).

`table` — when writes are cheaper than the reads would be

A table materialization does CREATE OR REPLACE TABLE ... AS ... every run. You pay the full rewrite cost on the schedule, in exchange for cheap reads the rest of the time.

table is the right mart default when:

The model fits the warehouse in a reasonable time (say, under ten minutes).
Downstream reads are more frequent than writes.

Move to incremental when the full-refresh cost becomes painful.

`incremental` — the one that earns its complexity

Incremental models merge, append, or overwrite only the new data. They are the single largest source of pain in a production dbt project. They are also the single largest source of cost savings when the source is big.

Read the dedicated incremental model guide before building your first one. The failure modes (drift, schema change, unique-key violations) are shape-specific.

`materialized_view` — hand the refresh logic to Databricks

materialized_view delegates to Delta Live Tables under the hood. Databricks owns the incrementalization: no is_incremental(), no unique_key debate, no on_schema_change config. You write the aggregation query; DLT figures out what to change.

The trade-off:

You get less control over when and how the refresh happens.
Some operations (complex joins, certain window functions) are not supported inside an MV.

Good default for aggregation-heavy gold tables where "eventually consistent, roughly hourly" is fine.

`streaming_table` — when latency beats cost

streaming_table reads from a Delta change feed or a streaming source (Kafka, Kinesis). Use it when:

Consumers need data within seconds, not hours.
The source produces a change feed you can subscribe to.
You are willing to pay for continuous compute.

Not appropriate for dashboards. Very appropriate for operational surfaces (fraud detection, live product counts).

`ephemeral` — useful, sometimes

An ephemeral model is not materialized at all. dbt inlines it into the callers' compiled SQL as a CTE.

Pros:

No storage cost.
Keeps intermediate logic DRY.

Cons:

Does not appear as its own step in the Databricks query history. If the compiled SQL is slow, you cannot isolate which ephemeral is to blame.
Inflates compiled SQL length when referenced by many models.

Rule of thumb: ephemeral for intermediates used by one or two downstream models, view when used by more.

Picking a strategy for `incremental`

Five strategies, all Delta-only unless noted:

Strategy	Use when
`merge` (default)	Updates happen to existing rows
`append`	Events are immutable (pure inserts)
`insert_overwrite`	You rewrite whole partitions, like "today"
`replace_where`	You rewrite a predicate-scoped slice
`microbatch` (1.9+)	Late-arriving events in a time-series fact

Start with merge if you have updates, append if you do not. Move to insert_overwrite or replace_where if the merge cost hurts and the rewrite boundary is clean.

Physical layer: clustering

On Databricks, every table and incremental should use liquid clustering unless you have a specific reason to partition.

{{ config(
    materialized='table',
    liquid_clustered_by=['customer_id', 'event_date']
) }}

Liquid clustering adapts as query patterns shift and eliminates the "wrong partition key" mistake. Fall back to partition_by only when you need hard isolation (GDPR region boundaries, for instance) or you are on a runtime too old for clustering.

The five

Decision framework

Defaults per layer

view — the default you should reach for

table — when writes are cheaper than the reads would be

incremental — the one that earns its complexity

materialized_view — hand the refresh logic to Databricks

streaming_table — when latency beats cost

ephemeral — useful, sometimes

Picking a strategy for incremental